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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

Confirmation No,; 6968 
Art Unit: 1642 
Examiner: Harris, A. M. 
Atty. Docket: 0609.4560002/KRM/DJN 



Declaration Under 37 C.F.R. § 1.132 of Kenneth D. Bloch, MJ>, 

Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313*1450 

Sir. 

I, the undersigned, Kenneth D. Bloch, M.D., residing at 80 Park Street, Apt 32, 
Brookline, Massachusetts, 02446, declare and state as follows: 

1. I am currently employed by Massachusetts General Hospital in the 
Cardiovascular Research Center where I conduct and supervise research in the field of 
molecular cardiology. I worked with Dr. En Li, co-inventor of the captioned application, in 
the Cardiovascular Research Center for 1 0 years until 2003 when Dr. Li moved to Novartis . 
I am also an Associate Professor of Anesthesiology and Medicine at Harvard Medical 
School and have extensive experience in molecular biology and DNA cloning and 
sequencing, 

2. A current curriculum vitae is appended hereto as Exhibit A- 

3 . I have been informed that human DNMT3 A cDNA done, represented in the 
captioned patent application as SEQ ID NO:3, was deposited with the ATCC on July 10, 
1998, and assigned ATCC Deposit No. 98809. I have also been informed that the deposit 
date of July 10, 1998 was prior to the filing date of the second provisional application, App. 
No. 60/093,993, filed July 24, 1998, the benefit of which is claimed. Finally, I have been 
informed that the T 993 application includes the sequence information and references the 
deposit of the sequenced material on page 16, lines 1-2, of the specification. 

4. In November 2004, Applicants had samples withdrawn of the human 
DNMT3A cDNA clone contained within ATCC Deposit No. 98809. At the Applicants 
request, Ihave sequenced the nucleotides that span the coding region of the deposited human 
DNMT3 A cDNA clone contained in ATCC Deposit No. 98809. A nucleotide alignment that 
spans the coding regions of the sequenced human DNMT3A cDNA clone contained in 
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ATCC Deposit No, 98809 and currently amended SEQ CD NO:3 is shown in Fig. 1 of 
ExhibitB. 

5. The amendment to the sequence listing, which was filed on July 23, 2001, 
corrected six nucleotide errors in the coding sequence of SEQ ID NO:3 (see bolded 
nucleotides at nucleotide positions 940,1476, 1479, 1 570, 2024 and 21 1 9 of amended SEQ 
ID NO:3 in Fig. 1 afExHrarrB). The amendment also deleted original nucleotides 1-123 of 
SEQ* ID NO:3, which does not include any DNMT3A coding sequence. 

6. The deposited clone recited in ^4 and 5, above (Le. 9 ATCC Deposit No. 
98809) is the same as the deposited clone recited in the above-captioned application. Hie 
six nucleotides in the coding region of SEQ ID NO:3 that were corrected by the amendment 
of July 23, 2001 correspond to the sequence contained in ATCC Deposit No. 98809. It is 
well known that sequencing errors are a common problem in Molecular Biology* Peter 
Richterich, Estimation of Errors in 'Raw' DNA Sequences: A Validation Study, 8 Genome 
Research 251-59 (1998)(EXHIBrT Q. I believe that one skilled in the art would have 
sequenced the deposited material and recognized the sequencing errors. 

7. My sequencing of ATCC Deposit No. 98809 also revealed that nucleotides 
539-584 within the coding region of amended SEQ ID NO:3 are deleted in the deposited 
cDNA, The deletion causes a frame shift in the coding region of the deposited cDNA and 
predicts a truncated protein of 145 amino acids. An amino acid alignment of the predicted 
amino acid sequence encoded by the human DNMT3A cDNA clone contained in ATCC 
Deposit No. 98809 and the predicted amino acid sequence encoded by amended SEQ ID 
NO:3 is shown in Fig. 2 of Exhibit B. The bolded sequence jn Fig. 2 corresponds to the 
predicted encoded amino acid sequence downstream of the nucleotide deletion in ATCC 
Deposit No.98809 and represents a point of divergence compared with the predicted amino 
acids encoded by currently amended SEQ ID NO:3. 

8. Currently amended SEQ ID NO: 3 and SEQ ID NO:3 as originally filed in 
U.S. Appl. Nos. 60/090,906 and 60/093,993, to which priority is claimed, do not harbor the 
deletion, and encode a protein having 912 amino acids that is homologous to mouse 
Dnmt3a. 

9. Like DNA sequence errors, it is known that errors in DNA cloning may occur 
in molecular biology. Deletion euors may occur and may be caused by, inter alia, 
inadvertent digestion of DNA by restriction endonuclcases or exonucleases, or by 
recombination events during propagation of the DNA in bacterial hosts, The deletion at 
nucleotides 539-584 in SEQ ID NO:3 present in ATCC Deposit No. 98809 is an obvious 
error. I believe that one skilled in the art would have sequenced the deposited material and 
recognized the deletion as an error. My belief is based upon the following: First, the 
deletion found in ATCC Deposit No. 98809 is not present in SEQ ID NO:3 as originally 
filed or as amended. Second, the deletion is not present in the DNA sequence of the closely 
related mouse homo log, SEQ ID NO: I. Third, the deletion causes a frame shift in the 
reading frame of SEQ ID NO: 3 and predicts a tmncated protein product compared with that 
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encoded by SEQ ID NO:3 as originally filed and as amended. Fourth, the amino acids 
encoded by the nucleotide sequence downstream of the deletion bear no similarity to the 
amino acids encoded by SEQ ID NO;3 or the mouse homolog of Dnmt3a, encoded by SEQ 
ID NO: 1 . Finally, an examination of the sequence reveals two large open reading frames 
(ORF) in the sequence in different frames, See Fig. 3 of Exhibit B. The ORFs correspond 
to the amino acid residues of DNMT3A upstream and downstream of the deletion. The 
presence of two large ORFs in different frames indicates a possible frameshiftmg sequence 
error or deletion. All of these factors indicate that the deletion present in ATCC Deposit No. 
98809 is an error, and would be recognized as such by a person of ordinary skill in the art. 

10. It is my belief that the combination of ATCC Deposit No. 98809, which 
discloses the six nucleotides in the coding region of SEQ ID NOi3 amended on July 23, 
2001, in combination with SEQ ID NO:3 as originally filed in U.S. Appl. Nos. 60/090,906 
and 60/093,993, which disclose nucleotides 539-584 of amended SEQ ED NO:3, clearly 
conveys to someone skilled in the art the entire nucleotide sequence of amended SEQ ID 
NO:3. 

11. I hereby declare that all statements made herein of my own knowledge are 
true and that all statements made on information and belief arc believed to be true; and 
further that these statements were made with the knowledge that willful false statements and 
the like so made are punishable by fine or imprisonment, or both, under Section 1001 of 
Title 1 8 of the United Stales Code and that such willful false statements may jeopardize the 
validity of the present patent application or any patent issued thereon. 




Respectfully submitted, 



Date: fol^ 



CURRICULUM VITAE 



PART I: General Information 

DATE PREPARED: July 5, 2005 

Name: Kenneth D. Bloch 

Office Address : Cardiovascular Research Center 

Massachusetts General Hospital 
149 13 th Street (149-4201) 
Charlestown, MA 02129 
(617)724-9540 

Home Address: 80 Park Street, Apartment 32, Brookline, MA 02446 
E-Mail: kdbloch@partners.org FAX: (617)726-5806 

Place of Birth: New York, NY 
Education: 

1978 Sc.B. Brown University (Biomedicine) 

1981 M.D. Brown University 

Postdoctoral Training: 

Internship and Residencies: 

1981-1984 Internal Medicine, Massachusetts General Hospital 

Fellowships: 

1 98 1 - 1 984 Clinical Fellow in Medicine, Harvard Medical School 

1 984- 1 987 Research Fellow in Genetics, Harvard Medical School 

1 987-1 989 Clinical and Research Fellow in Medicine, Harvard Medical School 

Licensures and Certification: 

1 984_ Commonwealth of Massachusetts, Registered Physician, 

1 985 Diplomate, American Board of Internal Medicine 

1 989 Diplomate, Subspecialty Board of Cardiovascular Diseases, 

American Board of Internal Medicine 

Academic Appointments: 

1 989- 1 990 Instructor in Medicine, Harvard Medical School 

1 990- 1 997 Assistant Professor of Medicine, Harvard Medical School 
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1 " 7 " Associate Professor of Medicine, Harvard Medical School 

2005 " Associate Professor of Anaesthesia, Harvard Medical School 

Hospital or Affiliated Institution Appointments: 

1 990- 1 996 Assistant in Medicine, Massachusetts General Hospital 

1 996- 1 999 Assistant Physician, Massachusetts General Hospital 
1 999-2005 Associate Physician, Massachusetts General Hospital 
2005- Physician, Massachusetts General Hospital 

Other Professional Positions and Major Visiting Appointments: none 

Hospital and Health Care Organization Service Appointments: 

1 990- Attending Physician, Cardiology and Medical Services, 

Massachusetts General Hospital 
1 99 1 - Practice Member, Cardiac Unit Associates, 

Massachusetts General Hospital 

Major Administrative Responsibilities: 

1989- 1992 Principal Investigator, Cardiac Unit, Massachusetts General Hospital 

1 990- Associate Program Director, Fellowship Program in Cardiovascular 

Disease, Massachusetts General Hospital 

1 992- Principal Investigator, Cardiovascular Research Center, Massachusetts 

General Hospital 

1993- 1997 Preceptor, Training Grant to the Cardiovascular Research Center (PI: 

Mark C. Fishman), Massachusetts General Hospital 

1 997- 2002 Co-Investigator, Training Grant to the Cardiovascular Research Center 

(PI: Mark C. Fishman), Massachusetts General Hospital 
2002-2004 Interim Director, Cardiovascular Research Center, Massachusetts 

General Hospital 

2002- Principal Investigator, Training Grant to the Cardiovascular Research 

Center, Massachusetts General Hospital 

Major Committee Assignments: 

Hospital: 

1991 Member, Selection Committee for the Fellowship Program in 

Cardiovascular Disease, Massachusetts General Hospital 
1 992- 1 996 Member, Subcommittee on Review of Research Proposals, Committee 

on Research, Massachusetts General Hospital 
1 997- Member, Steering Committee coordinating the integration of the 

clinical cardiology fellowship programs at Brigham and Women's 

Hospital and Massachusetts General Hospital 



Kenneth D. Bloch-2 



Regional: 
1991-1994 

2002 



National: 
1992-1995 

1996-1997 

2000 

2000-2002 

2002- 

2002- 

2005 

2005- 
2005- 



Member, Research Peer Review Committee, American Heart 
Association, Massachusetts Affiliate, Inc. 

Member, Northeast Peer Review Study Group for Lipids, Thrombosis 
& Vascular Wall Biology, American Heart Association 

Member, Molecular Signaling Research Study Committee, American 
Heart Association, National Center 

Member, Lung Biology and Pathology Study Section, National Heart 
Lung and Blood Institute 

Member-At-Large, Executive Committee of the Council on 
Cardiopulmonary and Critical Care, American Heart Association 
National Center ' 

Member, Program Committee, Council on Cardiopulmonary and 
Critical Care, American Heart Association, National Center 
Chairman, Program Committee, Council on Cardiopulmonary and 
Critical Care, American Heart Association, National Center 
Member, Committee on Scientific Sessions Program, American Heart 
Association 

Member, Respiratory Integrative Biology and Translational Research 
Study Section, National Heart Lung and Blood Institute 
Member, Research Committee, American Heart Association 
Member, Future of Scientific Sessions Task Force, American Heart 
Association 



Professional Societies: 



1997- 
2000 

2001 



American Heart Association, Council on Basic Cardiovascular Science 
American Heart Association, Council on Cardiopulmonary and Critical 
Care 

American Society for Clinical Investigation 



Editorial Boards: 



1989- 



Ad hoc reviewer: 

American Journal of Pathology 

American Journal of Physiology 

American Journal of Respiratory Cell and Molecular Biology 

Anesthesiology 

Circulation 

Circulation Research 

Journal of Applied Physiology 

Journal of Biological Chemistry 

Journal of Clinical Investigation 

New England Journal of Medicine 

Nature Medicine 
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Proceedings of the National Academy of Sciences 
Awards and Honors: 

Magna cum laude, Brown University 
Patricia McCormick Memorial Prize, Brown University 
Research Fellowship, Leukemia Society of America 
Postdoctoral Fellowship, Pfizer Pharmaceuticals 
Grant-in-Aid, American Heart Association, National Center (declined) 
Finalist, Circulation Council Cardiovascular Research Competition 
Second Prize for the basic science research abstracts 
from the Massachusetts Thoracic Society 
Charter Fellow of the Council on Basic Cardiovascular Sciences, 
American Heart Association 

Inaugural Fellow of the Basic Sciences Council, American Heart 
Association 

Inaugural Fellow of the Council on Cardiopulmonary, Perioperative, 
and Critical Care, American Heart Association 

Part II: R esearch. Teaching, and Clinical Contributions 

A. Narrative Report 

Dr. Bloch's research has focused on cardiovascular biology and the molecular 
mechanisms regulating vascular tone and ventricular remodeling. Dr. Bloch established 
his own laboratory at MGH in 1991 and became a principal investigator in the 
Cardiovascular Research Center in 1992. 

When he established his laboratory, it appeared that there was a single constitutively- 
expressed nitric oxide (NO) synthase (NOS) in brain and endothelium. Dr. Bloch's group 
cloned the NOS isoform responsible for endothelial NO production, NOS3. Dr. Bloch's 
group discovered that NOS3-deficient mice are more susceptible to hypoxia-induced 
pulmonary vascular remodeling. Enhanced pulmonary NO (either by inhalation of NO 
gas or by aerosol delivery of an adenovirus specifying NOS attenuates pulmonary 
vasoconstriction and prevents this pulmonary vascular remodeling. Moreover, they found 
that NO inhalation could prevent pulmonary vascular remodeling even before the 
development of pulmonary vasoconstriction. Taken together, these studies have 
important implications for the treatment of children with congenital heart disease, in 
whom pulmonary vascular remodeling precedes the development of pulmonary 
hypertension (which is often fatal). Dr. Bloch and his colleagues were the first to show 
that NO inhalation also has systemic vascular effects including attenuation vascular 
neointima formation after balloon injury and prevention re-thrombosis after coronary 
artery thrombolysis (the latter is markedly potentiated by inhibitors of cGMP- 
metabolizing phosphodiesterases). These innovative studies have direct clinical 
implications for the care of patients with acute coronary syndromes. 



1978 
1981 

1984-1986 

1986-1987 

1996-1998 

1996 

2000 

2001 

2002 

2003 
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Dr. Bloch has brought into focus the fact that regulation of NO responsiveness may be as 
important as regulation of NO production . Dr. Bloch found that prolonged exposure of 
vascular smooth muscle cells (VSMC) to NO or pro-inflammatory cytokines decreases 
function of soluble guanylate cyclase (sGC; the enzyme responsible for cGMP synthesis 
in response to NO), desensitizing the cells to NO. Dr. Bloch also showed that cGMP- 
dependent protein kinase, an enzyme responsible for vasodilation, also has a critical role 
in determining the sensitivity of these cells to the antiproliferative and proapoptotic 
effects of NO and cGMP. 

Dr. Bloch's group has contributed importantly to a second research area-the structure and 
function of the "nuclear body", a nuclear structure that appears to have a critical role in 
oncogenesis, gene transcription, and the cellular response to viral infection. They have 
cloned two new nuclear body components, both of which appear to be transcription 
factors and one of which is a co-activator of nuclear hormone receptors. 

Dr. Bloch's research group has elucidated an important role for NO synthesized by 
endothelial nitric oxide synthase (NOS3) in the left ventricular remodeling induced by a 
variety of hemodynamic challenges. In addition, they have explored the mechanisms 
responsible for the impairment of hypoxic pulmonary vasoconstriction associated with 
pulmonary injury associated with endotoxemia and volutrauma. 

Most recently, Dr. Bloch's group has begun a new line of research directed at 
understanding how mutations in the gene encoding bone morphogenetic protein receptor 
type 2 (BMPR2) cause primary pulmonary hypertension. They have observed that mice 
carrying one copy of a mutant BMPR2 gene have mild pulmonary hypertension 
associated with abnormalities of pulmonary vascular structure. 

Dr. Bloch has also made important contributions in the clinical and research training of 
cardiology fellows. He is a primary research mentor for the MGH cardiology fellows 
guiding them to, and supporting them in, the best research training opportunities HMS 
has to offer. In the past five years, Dr. Bloch has played a pivotal role in the integration 
of the cardiology fellowship programs at the Brigham and Women's Hospital and MGH 
providing fellows with exposure to outstanding clinical experiences at both institutions. 
Since 2002, Dr. Bloch has served as principal investigator of the T32 training grant 
awarded to the Cardiovascular Research Center. In this role, Dr. Bloch supervises the 
mentoring and career development of 10 post-doctoral cardiovascular scientists each year. 
From 2002 through 2004, Dr. Bloch served as the Interim Director of the CVRC fostering 
the creativity and productivity of ten faculty members and more than 40 post-doctoral 
research fellows. 

B. Funding Information (Research): 



Research Grant, Dr. Louis Skarow Memorial Fund (PI: K. Bloch) 
"Gene expression in a model of pulmonary hypertension." 
Grant-in- Aid, American Heart Association, National Center 
(PI: K. Bloch) "Pulmonary expression of endothelin genes." 



Past: 

1991-1992 
1991-1994 
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1 99 1 - 1 996 NHLBI/R29 (PI: K. Bloch) 

"Biosynthesis of the endothelin family of vasoactive peptides." 

1 995- 1 996 Sponsored Research Agreement through the Cardiovascular 

Research Center from Bristol-Myers Squibb (PI: K. Bloch) 
"Isolation and characterization of novel vascular nitric oxide 
synthases." 

1996- 1997 NHLBI/R01 (Co-PI: K. Bloch) 

"The pulmonary response to inhaled particulates." 
1996-2000 NHLBI/R01 (PI: K. Bloch) 

"Nitric oxide/cGMP signal transduction in pulmonary injury." 
1996-2001 Established Investigator, American Heart Association, National Center 

(PI: K. Bloch) "Regulation of a nitric oxide receptor component, the (31 

subunit of soluble guanylate cyclase." 
1998-2002 NHLBI/T32 (Co-PI: K. Bloch; PI: M.C. Fishman) 

"Cell and molecular training for cardiovascular biology " 
1998-2003 NHLBI/R01 (PI: K. Bloch) 

"Nitric oxide/cGMP signal transduction in vascular injury." 

2001- 2002 Pfizer Pharmaceuticals (Co-PI: K. Bloch; PI: M.J. Semigran) 

"Evaluation of the effects of sildenafil with and without inhaled nitric 
oxide (NO) on platelet-mediated thrombosis and cardiac function in a 
canine model of cyclic coronary artery occlusion." 

Current: 

1996- NHLBI/R01 (Co-PI: K. Bloch; PI: W. Zapol) 

"Studies of inhaled nitric oxide." 

2002- INO Therapeutics, Inc. (PI: K Bloch) 

"Laboratory-based initiatives for the further development of the 
therapeutic potential of inhaled nitric oxide." 

2002- NHLBI/T32 (PI: K. Bloch) 

"Cell and molecular training for cardiovascular biology " 

2003- NHLBI7R01 (PI: K. Bloch) 

"Nitric oxide synthase 3 and left ventricular remodeling" 
2003- NHLBI/R01 (PI: K. Bloch) 

"BMPR2 in the pathogenesis of pulmonary hypertension" 

C. Report of Current Research Activities (Bench and Clinical Research): 

Project 1 : Studies of inhaled nitric oxide. (Co-PI: K. Bloch) 
Project 2: Evaluation of the systemic effects of breathing nitric oxide (PI: K. Bloch) 
Project 3: Role of nitric oxide in left ventricular remodeling (PI: K. Bloch) 
Project 4: Role of BMPR2 in the pathogenesis of pulmonary hypertension (PI: K. Bloch) 
Project 5: Evaluation of nitric oxide inhalation to treat cardiogenic shock complicating 
right ventricular infarction. (Co-PI: K. Bloch) 
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D. Report of Teaching 

1. Local Contributions 
a, medical school 

1 977 Brown University, Biomed 1 1 0, Biophysics 

course director: Babette Stewart 
Teaching Assistant 
50 undergraduates (approx.) 

3-5 hours preparation and contact time/week (approx.) 
semester course 

1 978- 1981 Brown University, Biomed 6, Introduction to Physiology 

course director: Peter Stewart 
Teaching Assistant 
50 undergraduates (approx.) 

3-5 hours preparation and contact time/week (approx.) 
semester course 

1 986 Harvard University, Genetics 700.0, Fundmentals of Genetics 

course director: Philip Leder 
Teaching Assistant 
15 medical students (approx.) 
3 hours preparation and contact time/week (approx.) 
semester course 

1 990- 1 992 Harvard Medical School, Introduction to Clinical Medicine 

Clinical Mentor 

2-3 medical students, 3 sessions/week, 2 hours/session, 3 weeks/year 
total: 9 hours preparation time, 18 hours contact time 

1993-1996 Harvard Medical School, Patient-Doctor II Course 

course directors: Katherine Treadway and Diane Fingold 
Preceptor 

50 medical students (approx.) 

3 sessions, 2 hours/session, 1/year 

total: 9 hours preparation and contact time 

b. graduate medical course : none 

c. local invited teaching presentations 

1993-1994 "Ethical Conduct of Research" Course, Massachusetts General 

Hospital, Preceptor, 5-10 postdoctoral fellows, 2 sessions/year, 
2 hours/session, prep time: 1 hour/session. 

1 994 Cardiology Grand Rounds, Massachusetts General Hospital, Lecturer; 
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50 attendees: medical students, residents, clinical and research fellows, 
faculty; 4 hrs prep and 1 hr contact time. 

1 996 Anesthesiology Grand Rounds, Massachusetts General Hospital; 

Lecturer; 50 attendees: medical students, residents, clinical and 
research fellows, faculty; 4 hrs prep and 1 hr contact time. 

1 995 Wellman Laboratories of Photomedicine Symposium, Massachusetts 

General Hospital, Session Chair; 50 attendees: medical students, 
residents, clinical and research fellows, faculty; 1 hr prep and contact time. 

1 997 Cardiology Grand Rounds, Massachusetts General Hospital, Lecturer; 

50 attendees: medical students, residents, clinical and research fellow's, 
faculty; 4 hrs prep and 1 hr contact time. 

1 998 Clinical Fellows' Core Curriculum Lecture Series, Massachusetts 

General Hospital, Lecturer; 5 attendees: medical students and fellows; 
4 hrs prep time and 1 hr contact time. 

1 999 Pulmonary and Critical Care Unit Research Conference, Massachusetts 

General Hospital, Lecturer; 50 attendees: medical students, residents, 
clinical and research fellows, faculty; 4 hrs prep and 1 hr contact time. 

2000 Center for the Prevention of Cardiovascular Disease, Department of 

Nutrition, Harvard School of Public Health, Lecturer; 50 attendees: 
medical students, residents, clinical and research fellows, faculty; 

4 hrs prep and 1 hr contact time. 

200 1 West Roxbury Veterans Administration Hospital, Cardiology Grand 

Rounds; 30 attendees: medical students, residents, clinical and 
research fellows, faculty; 4 hrs prep and 2 hr contact time. 

2003 Brigham and Women's Hospital, Vascular Research Seminar; 30 

attendees: medical students, residents, clinical and research fellows, 
faculty; 4 hrs prep and 2 hr contact time. 

2004 Massachusetts General Hospital, Cardiology Grand Rounds; 50 

attendees: nurses, medical students, residents, clinical and research 
fellows, faculty; 5 hrs prep and 1 hr contact time. 

2005 Massachusetts General Hospital, Critical Care Research Retreat; 50 

attendees: clinical and research fellows, faculty; 1 0 hours prep time, 4 
hours contact time, 10 minute lecture 

d. continuing medical education courses : none 

e. advisory and supervisory responsibilities 

1 989- Attending Physician, Private and Ward Medical Services, 

Massachusetts General Hospital (variable hours/year) 

1989- 1992 Principal Investigator, Cardiovascular Research Center, Massachusetts 

General Hospital, scientific supervisor for Research Fellows, including 
one Associate Professor and one Assistant Professor of Anesthesia 
(1,000 hours/year). 

1 990- Attending Physician, Cardiology Consult Service, Massachusetts 

General Hospital (140 hours/year). 

1991- Research Advisor, Cardiology Fellowship Program, Massachusetts 
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General Hospital, 25-35 Fellows per year in clinical and research 
training (200 hours/year). 

1 992-2002 Attending Physician, Coronary Care Unit, Massachusetts General 

Hospital (variable hours/year). 

1 992- Principal Investigator, Cardiovascular Research Center, Massachusetts 

General Hospital, scientific supervisor for Research Fellows, 
including two Assistant Professors of Anesthesia (1,000 hours/year). 

2002- Principal Investigator, NIH-sponsored program (T32) to train 1 0 post- 

doctoral cardiovascular scientists each year (200 hours/year) 

f. teaching leadership role 

1 990- 1 999 Cardiac Unit Research Seminar Series, Massachusetts General 

Hospital; Organizer; A weekly series of presentation by staff and 
senior fellows designed to highlight research in the Cardiac Unit. 

1 995- 1 996 Cardiac Unit Society of Fellows, Massachusetts General Hospital; 

Organizer; A quarterly series of symposia presented by research 
fellows at the home of the Chief of the MGH Cardiac Unit. 

1 997 Society of Cardiology Fellows, Massachusetts General Hospital and 

Brigham and Women's Hospital; Co-Organizer; A quarterly series of 
symposia designed to foster scientific communication and 
collaboration between MGH and BWH and to highlight research 
opportunities for fellows in cardiology training. 

2002 " CVRC Seminar Series; Organizer; A weekly seminar series presented 

by visiting scientists in the MGH Cardiovascular Research Center 

g. names of advisees and trainees/current positions 

1 989- 1 990 Charles C. Hong, MD/PhD, Clinical and Research Fellow, 

Cardiology Division, Massachusetts General Hospital 

1 990- 1 99 1 Robert Schott, MD, MPH, private practice 

1 99 1 - 1 993 Akito Shimouchi, MD, Assistant Professor of Medicine, National 

Cardiovascular Center Research Institute, Osaka, Japan 

1 990- 1 99 1 Stefan P. Janssens, MD, PhD, Associate Professor of Medicine, 

Cardiac Unit and Laboratory for Molecular and Vascular Biology, 
University Hospital Gasthuisberg, Leuven, Belgium 

1 992-1 994, 200 1 - Noriko Kawai, MD, PhD, Research Fellow in Medicine, 

Cardiovascular Research Center, Massachusetts General Hospital 

1992- 1993 John J. Lepore, MD, Instructor of Medicine, University of 

Pennsylvania Medical Center, Department of Medicine, 
Cardiovascular Medicine Division and Molecular Cardiology 
Laboratory 

1 993- 1994 Johanna Wolfram, MD, University Clinic for Internal Medicine II, 

Department of Cardiology, General Hospital, Vienna, Austria 
1 993- 1 999 Lucienne Sanchez, MD, Instructor in Pediatrics, Harvard Medical 

School/Massachusetts General Hospital 
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1993- 2004 Jesse D. Roberts, MD, Associate Professor of Anesthesiology in 

Pediatrics, Harvard Medical School/Massachusetts General Hospital 

1 994- 1 995 Jeffrey Thomas, MD, Associate Professor of Neurosurgery, Chief of 

Neurovascular and Neurointerventional Surgery, Division of 
Neurosurgery, The University of New Mexico Health Sciences 
Center 

1994- 1996 Alexandra Holzmann, MD, Department of Anesthesiology, 

University of Heidelberg 

1 995- 2004 Heling Liu, MD, on leave to care for her children 

1995- 1998 Jean-Daniel Chiche, MD, Professor, Cochin University, Paris, 

France 

1 996- 1 997 Anita Honkanen, MD, private practice 

1996-1998 Masao Takata, MD/PhD, Assistant Professor, Department of 

Anaesthetics and Intensive Care, Imperial College School of 
Medicine, Hammersmith Hospital, London, United Kingdom 

1 996- 1 998 Douglas Wirthlin, MD, Assistant Professor of Surgery, Vascular 

Surgery, University of Alabama at Birmingham 

1996-1998 Joerg Weimann, MD, Professor of Anesthesia, Department of 

Anaesthesiology and Intensive Care Medicine, Charite -Berlin 
Medical School, Campus Benjamin Franklin 

1 998-2002 Galina Filippov, MD, Research Scientist, Omnigene Bioproducts 

Inc. 

1 998, 2000-200 1 Zena Quezado, MD, Chief, Department of Anesthesia and Surgical 

Services, Warren G. Magnuson Clinical Center, National Institutes 
of Health 

1998- 2000 Roman Ullrich, MD, Associate Professor of Anesthesia and 

Intensive Care Medicine, Vienna General Hospital, Medical 
University of Vienna 

1 999- 200 1 Hiroshi Nakajima, MD, Neurosurgery Residency, Tokyo Women's 

Medical College, Tokyo, Japan 
1 999-200 1 Pini Orbach, PhD, Proj ect Manager, Drug Development, Perdix 

Pharmaceuticals, Inc. 
1 " 9 " Fumito Ichinose, MD, PhD, Assistant Professor of Anesthesia, 

Harvard Medical School/Massachusetts General Hospital 

200 1- 2003 Aimee Limbach, PhD, Post-doctoral Fellow, Center for Human 

Molecular Genetics, Munroe-Meyer Institute, University of Nebraska 
Medical Center 

2002- 2003 Elisabeth Choe, MD, Resident in Internal Medicine, University of 

Texas Southwestern 

2002- 2004 Cornelius Busch, MD, Resident in Anesthesia, Department of 

Anesthesiology, University of Heidelberg 
2003 Claire Mayeur, MD, Resident in Anesthesiology, Lille, France 

2003_ Hideyuki Beppu, MD, PhD, Instructor in Medicine, Cardiovascular 

Research Center, Massachusetts General Hospital 

2003 - Manu Buys, PhD, Research Fellow in Medicine, Cardiovascular 

Research Center, Massachusetts General Hospital 
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2003 - Paul Yu, MD, PhD, Clinical and Research Fellow in Medicine, 

Cardiovascular Research Center, Massachusetts General Hospital 

2004- 2005 David Bayne, undergraduate student, Harvard University 
2m ~ Ryuji Hataishi, MD, PhD, Research Fellow in Anesthesia, 

Massachusetts General Hospital 

2004 - Ra jeev Malhotra, MD, Resident in Internal Medicine, Massachusetts 

General Hospital 

2004 " Tomas Neilan, MD, Clinical and Research Fellow in Medicine, 

Cardiac Ultrasound Laboratory and Cardiovascular Research Center, 
Massachusetts General Hospital 

2005 - Sarah Blake, MD, Research Fellow in Anesthesia, Massachusetts 

General Hospital 

2. Regional National, or International Contributions 

1 994 Invited Lecturer, American College of Cardiology, Dallas, TX 

1 994 Invited Lecturer, Pfizer Pharmaceuticals, Groton, CT 

1 994 Invited Lecturer, University of Leuven, Belgium 

1995 Invited Lecturer, St. Elizabeth's Hospital, Cardiovascular Research 

Seminar, Boston, MA 

1 996 Invited Lecturer, Boston Heart Foundation, Boston, MA 

1 996 Invited Lecturer, Georgia Medical College, Vascular Biology 

Division, Atlanta, GA 

1 997 Invited Lecturer, Harvard Medical School, Vascular Biology 

Seminar, Boston, MA 

1 997 Invited Lecturer, Boston University, Whittaker Foundation, Boston, MA 

1 998 Invited Lecturer, Oregon Health Sciences University, Cardiology 

Division, Oregon 

1 998 Invited Lecturer, Brigham and Women's Hospital, Cardiology 

Division, Monday Morning Research Conference, Boston, MA 

1 998 Invited Lecturer, New York Medical College, Department of 

Pharmacology Seminar, New York, NY 

1 999 Invited Lecturer, Tufts University School of Medicine, Dept. of 

Medicine, Boston, MA 

1 999 Invited Lecturer, Tufts University School of Medicine/New England 

Medical Center, Pulmonary and Critical Care Division, Boston, MA 
1999 Invited Lecturer, Millenium Pharmaceuticals, Inc., Cambridge, MA 

1 999 Invited Lecturer, University of Washington, Cardiology Dept 

Seattle, WA 

1 999 Invited Lecturer, University of Alabama at Birmingham, Dept. of 

Pathology, Birmingham, AL 
1 999 Invited Lecturer, 3 rd International Society for Medical Gases 

Meeting, Heidelberg, Germany 
1 999 Invited Lecturer, National Institute of Health, Critical Care 

Medicine, Bethesda, MD 
2004 Invited Lecturer, INO Therapeutics Inc. Scientific Advisory Board 

Chatham, MA 
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2002 Invited Lecturer, Medical College of Wisconsin, Milwaukee, 

Wisconsin 

2002 Invited Lecturer, American Heart Association Scientific Sessions, 

Chicago, IL 

2003 Invited Lecturer, Department for Molecular and Biomedical 

Research, Universiteit Gent, Belgium 

2003 Invited Lecturer, Whitaker Cardiovascular Institute, Boston 

University Medical School 

2003 Invited Lecturer, Cardiology Division, University of Alberta, 

Edmonton, Canada 

2004 Invited Lecturer, American Heart Association, Northeast Affiliate, 

Symposium: Launching a Career in Cardiovascular Research 

2004 Invited Lecturer, Cardiology Grand Rounds, Dartmouth-Hitchcock 

Medical Center— "Nitric oxide synthases in ventricular remodeling: 
insights gained from genetically-modified mice." 

2004 Invited Lecturer, Vascular Biology Seminar, Dartmouth-Hitchcock 

Medical Center — "Mechanisms regulating pulmonary vascular 
structure and function— roles of leukotrienes and bone 
morphogenetic proteins." 

2004 Invited Lecturer, Department of Physiology Seminar, Louisiana State 

University-Shreveport— "Nitric oxide synthases in ventricular 
remodeling: insights gained from genetically-modified mice." 

2004 Invited Lecturer, Cardiovascular Cell and Gene Therapy Conference 

II, Cambridge, MA— "Nitric oxide/cGMP signal transduction— 
implications for cardiovascular gene transfer." 

2004 Invited Lecturer, Critical Therapeutics, Inc., Lexington, MA— 

"Mechanisms of pulmonary vascular dysfunction in lung injury: 
insights gained from genetically-modified mice." 



E. Report of Clinical Activities 

Dr. Bloch is a practicing cardiologist who maintains a practice within the Cardiac Unit 
Associates and Cardiology Division at the Massachusetts General Hospital. His practice 
consists of patients with cardiac problems of a moderate to high level of complexity 
referred to a tertiary care center. 
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Exhibit B 

FIGURE 1 

Alignment of human DNMT3A from ATCC Deposit No. 98809 (top) and currently 

amended SEQ ID NO:3 (bottom) 1 



gccgcggcaccagggcgcgcagccgggccggcccgaccccaccggccatacggtggagcc 

1 1 1 1 1 M 1 1 1 1 1 J E 1 1 1 1 [ 1 1 i 1 1 M [ 1 1 II 1 1 ! M j M 1 1 ! 1 1 1 1 1 f I M 1 1 1 1 1 1 1 1 1 

gccgcggcaccagggcgcgcagccgggccggcccgaccccaccggccatacggtggagcc 60 
atcgaagcccccacccacaggctgacagaggcaccgttcaccagagggctcaacaccggg 

MINI III I II llllll till lllf MIMIIIMII IMMI II Ml! MINIM II 

atcgaagcccccacccacaggctgacagaggcaccgttcaccagagggctcaacaccggg 120 
atctatgtttaagttttaactctcgcctccaaagaccacgataattccttccccaaagcc 

II I Ml I II I II 1 1 II II II II III I MM II II II 1 1 INI II II INI INI I II I II 

atctatgtttaagttttaactctcgcctccaaagaccacgataattccttccccaaagcc 180 
cagcagccccccagccccgcgcagccccagcctgcctcccggcgcccagatgcccgccat 

1 1 1 1 1 1 1 E ! 1 1 1 1 1 1 1 [ I i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j I i I 

cagcagccccccagccccgcgcagccccagcctgcctcccggcgcccagatgcccgccat 24 0 
gccctccagcggccccggggacaccagcagctctgctgcggagcgggaggaggaccgaaa 

llllll IIIIIIIIIIMIIIIMIIIIIIIIIIIIMIIIIIIIIIIIMIIIMIIII 

gccctccagcggccccggggacaccagcagctctgctgcggagcgggaggaggaccgaaa 3 00 
ggac'ggagaggagcaggaggagccgcgtggcaaggaggagcgccaagagcccagcaccac 

MINI Ml I II IIIIIMIIMIII MIMIMIIII IMIMIIIIII MM MM II 

ggacggagaggagcaggaggagccgcgtggcaaggaggagcgccaagagcccagcaccac 3 60 
ggcacggaaggtggggcggcctgggaggaagcgcaagcaccccccggtggaaagcggtga 
ggcacggaaggtggggcggcctgggaggaagcgcaagcaccccccggtggaaagcggtga 420 
cacgccaaaggaccctgcggtgatctccaagtccccatccatggcccaggactcaggcgc 

llllll Ml Mill MM MM IIMIMI llllllll MIMIII llllll II llllll 

cacgccaaaggaccctgcggtgatctccaagtccccatccatggcccaggactcaggcgc 4 80 
ctcagagctattacccaatggggacttggagaagcggagtgagccccagccagaggag . . 

1 1 1 1 1 ii 1 1 ii 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ctcagagctattacccaatggggacttggagaagcggagtgagccccagccagaggaggg 54 0 

agggtgcagctgagac 

gagccctgctggggggcagaagggcggggccccagcagagggagagggtgcagctgagac 600 
cctgcctgaagcctcaagagcagtggaaaatggctgctgcacccccaaggagggccgagg 
cctgcctgaagcctcaagagcagtggaaaatggctgctgcacccccaaggagggccgagg 660 
agcccctgcagaagcgggcaaagaacagaaggagaccaacatcgaatccatgaaaatgga 

MMMMMMMMMIIMMMMMMMMMMIIIMI MMIIIIIIMM 

agcccctgcagaagcgggcaaagaacagaaggagaccaacatcgaatccatgaaaatgga 720 



1 Bolded nucleotides indicate nucleotides that were amended on July 23, 2001 . 
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gggctcccggggccggctgcggggtggcttgggctgggagtccagcctccgtcagcggcc 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

gggctcccggggccggctgcggggtggcttgggctgggagtccagcctccgtcagcggcc 780 
catgccgaggctcaccttccaggcgggggacccctactacatcagcaagcgcaagcggga 

III Nil II I III MM III III III Nil III III lllllll II I III III I MM II I 

catgccgaggctcaccttccaggcgggggacccctactacatcagcaagcgcaagcggga 84 0 
cgagtggctggcacgctggaaaagggaggctgagaagaaagccaaggtcattgcaggaat 

MM Ml Ml IMIMI IIIIMIIIIIIIIIIIIIIMIIM llllll IMMIIMM 

cgagtggctggcacgctggaaaagggaggctgagaagaaagccaaggtcattgcaggaat 900 
gaatgctgtggaagaaaaccaggggcccggggagtctcagaaggtggaggaggccagccc 

MM IMMMMMIMMMI MMMMMM1MMMI IMMI MMMIMI I 

gaatgctgtggaagaaaaccaggggcccggggagtctcagaaggtggaggaggccagccc 960 
tcctgctgtgcagcagcccactgaccccgcatcccccactgtggctaccacgcctgagcc 

I Ml II I II Mill III III III Mill III II MIMMIMM MM lllllll I II I 

tcctgctgtgcagcagcccactgaccccgcatcccccactgtggctaccacgcctgagcc 1020 
cgtggggtccgatgctggggacaagaatgccaccaaagcaggcgatgacgagccagagta 

I MM M III M II I M, M I I M M IMMM MM III II , M I Ml II 1 1 M 

cgtggggtccgatgctggggacaagaatgccaccaaagcaggcgatgacgagccagagta 1080 
cgaggacggccggggctttggcattggggagctggtgtgggggaaactgcggggcttctc 

I MUM II III II II llllll III III III II lllllll Mill II I MM III II III 

cgaggacggccggggctttggcattggggagctggtgtgggggaaactgcggggcttctc 1140 
ctggtggccaggccgcattgtgtcttggtggatgacgggccggagccgagcagctgaagg 

lllllll II I II II Ml II Ml llllll Ml II lllllll II Ml II II MM II Mill 

ctggtggccaggccgcattgtgtcttggtggatgacgggccggagccgagcagctgaagg 1200 
cacccgctgggtcatgtggttcggagacggcaaattctcagtggtgtgtgttgagaagct 

I I j i J r 1 1 1 1 1 1 1 1 1 1 1 J 1 1 s i f 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 1 1 j j 1 1 1 1 1 1 j 1 1 1 1 f i j 

cacccgctgggtcatgtggttcggagacggcaaattctcagtggtgtgtgttgagaagct 12 60 
gatgccgctgagctcgttttgcagtgcgttccaccaggccacgtacaacaagcagcccat 

1 1 J 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 j 1 1 j j I ! 1 1 1 1 j 1 1 1 1 1 1 1 1 1 E j ! 1 1 1 1 1 1 

gatgccgctgagctcgttttgcagtgcgttccaccaggccacgtacaacaagcagcccat 132 0 
gtaccgcaaagccatctacgaggtcctgcaggtggccagcagccgcgcggggaagctgtt 

1 1 1 1 1 1 1 1 1 1 f I II I M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 it 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 E 1 1 1 1 1 1 

gtaccgcaaagccatctacgaggtcctgcaggtggccagcagccgcgcggggaagctgtt 13 80 
cccggtgtgccacgacagcgatgagagtgacactgccaaggccgtggaggtgcagaacaa 

MM MMMIMI IMMMIII MINI III IMIMI IIIIIIIIIIMI lllllll 

cccggtgtgccacgacagcgatgagagtgacactgccaaggccgtggaggtgcagaacaa 144 0 
gcccatgattgaatgggccctggggggcttccagccttctggccctaagggcctggagcc 

IMMIIMIIIII Ml lllllll llllll MMMIMI IIIIIIMIMII lllllll 

gcccatgattgaatgggcc.ctggggggcttccagccttctggccctaagggcctggagcc 1500 
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accagaagaagagaagaatccctacaaagaagtgtacacggacatgtgggtggaacctga 

IIIIIIIMIIIIMIIIIIIMIMM IIIIIIIMIIIIIIIIIIIIIII IMIIMI 

acca'gaagaagagaagaatccctacaaagaagtgtacacggacatgtgggtggaacctga 1560 
ggcagctgcctacgcaccacctccaccagccaaaaagccccggaagagcacagcggagaa 

llllllllllllllllllllll Ml II MIMIIIIIMI IMIM Mllll IIIMI II 

ggcagctgcctacgcaccacctccaccagccaaaaagccccggaagagcacagcggagaa 162 0 
gcccaaggtcaaggagattattgatgagcgcacaagagagcggctggtgtacgaggtgcg 

1 1 1 1 1 1 1 II 1 1 M 1 1 M 1 1 M 1 1 1 1 1 1 ! 1 1 1 1 i 1 1 1 1 1 1 M [ M 1 1 1 1 1 1 M M I ! 1 1 M 

gcccaaggtcaaggagattattgatgagcgcacaagagagcggctggtgtacgaggtgcg 1680 
gcag'aagtgccggaacattgaggacatctgcatctcctgtgggagcctcaatgttaccct 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 i 1 1 M 1 1 1 ! 1 1 1 i I [ 1 1 ! 1 1 1 1 1 1 ! I 

gcagaagtgccggaacattgaggacatctgcatctcctgtgggagcctcaatgttaccct 174 0 
ggaacaccccctcttcgttggaggaatgtgccaaaactgcaagaactgctttctggagtg 

1 1 1 1 i I E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M M M 1 1! ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

ggaacaccccctcttcgttggaggaatgtgccaaaactgcaagaactgctttctggagtg 1800 
tgcgtaccagtacgacgacgacggctaccagtcctactgcaccatctgctgtgggggccg 

1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 a 1 1 1 1 1 1 [ 1 1 1 1 1 1 F 1 1 1 1 ! 1 1 j I i I i 1 1 1 1 1 1 M 1 1 i 1 1 1 1 1 1 1 

tgcgtaccagtacgacgacgacggctaccagtcctactgcaccatctgctgtgggggccg 1860 
tgaggtgctcatgtgcggaaacaacaactgctgcaggtgcttttgcgtggagtgtgtgga 

1 1 1 1 1 1 M 1 1 1 1 j 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 ! M 1 1 1 1 1 1 M 1 1 1 1 ! 1 1 ! 1 1 1 1 1 1 1 1 1 1 

tgaggtgctcatgtgcggaaacaacaactgctgcaggtgcttttgcgtggagtgtgtgga 192 0 
cctcttggtggggccgggggctgcccaggcagccattaaggaagacccctggaactgcta 

i ii 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

cctcttggtggggccgggggctgcccaggcagccattaaggaagacccctggaactgcta 1980 
catgtgcgggcacaagggtacctacgggctgctgcggcggcgagaggactggccctcccg 

II M l III II III Mill III INI 1 1 MM III Mllll III I II II Mil III IIMI 

catgtgcgggcacaagggtacc tacgggctgctgcggcggcgagaggactggccc tcccg 2 04 0 
gctccagatgttcttcgctaataaccacgaccaggaatttgaccctccaaaggtttaccc 

IIMMIIMMIMIIIIMIMIMIMIIIIIMIIIMIMI llllllllllllll 

gctccagatgttcttcgctaataaccacgaccaggaatttgaccctccaaaggtttaccc 2100 
acctgtcccagctgagaagaggaagcccatccgggtgctgtctctctttgatggaatcgc 

II MM II II Mill III III MM 1 1 III lllllllll Mill II III I II IMIIMI 

acctgtcccagctgagaagaggaagcccatccgggtgctgtctctctttgatggaatcgc 2160 
tacagggctcctggtgctgaaggacttgggcattcaggtggaccgctacattgcctcgga 

IMIIMI IMIIIIIIIIMIIIIIIIIIIIIMIIIIIIIII IMIIMI III Ml II 

tacagggctcctggtgctgaaggacttgggcattcaggtggaccgctacattgcctcgga 222 0 
ggtgtgtgaggactccatcacggtgggcatggtgcggcaccaggggaagatcatgtacgt 

Mill Ml MIIIMIMIIMMIMMIIIMIIIIIIMIMI Mllll Mllll II 

ggtgtgtgaggactccatcacggtgggcatggtgcggcaccaggggaagatcatgtacgt 22 80 
cggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgatctggt 

II III III IMMIIIMMIMIIIIMIIIIIIIMIIMIMI Mllll IMIIMI 

cggggacgtccgcagcgtcacacagaagcatatccaggagtggggcccattcgatctggt 234 0 
gattgggggcagtccctgcaatgacctctccatcgtcaaccctgctcgcaagggcctcta 

II MMIMMMMIMMMMIMIMMMMIIMIIIMI Mllll Mllll II 

gattgggggcagtccctgcaatgacctctccatcgtcaaccctgctcgcaagggcctcta 24 00 
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cgagggcactggccggctcttctttgagttctaccgcctcctgcatgatgcgcggcccaa 

[illMIIIIIIIIMIIMIIIIIIMIIIIIIMIIIIIIIIIIIIII llllllllll 

cgagggcactggccggctcttctttgagttctaccgcctcctgcatgatgcgcggcccaa 2460 
ggagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttag 

IIIIMIIIMIIIIIIIilllllllMIIIIMIMIIIMIIIIIIII llllllllll 

ggagggagatgatcgccccttcttctggctctttgagaatgtggtggccatgggcgttag 252 0 
tgacaagagggacatctcgcgatttctcgagtccaaccctgtgatgattgatgccaaaga 

1 1 1 II IN III llllllllll III II I II I II MM INI I III III III II 1 1 III III 

tgacaagagggacatctcgcgatttctcgagtccaaccctgtgatgattgatgccaaaga 2580 
agtgtcagctgcacacagggcccgctacttctggggtaaccttcccggtatgaacaggcc 

i iiiiMiiiiiiiiiiiiiiiii mi iiiiMiiiiiiiiin mill ii iiiiiiii 

agtgtcagctgcacacagggcccgctacttctggggtaaccttcccggtatgaacaggcc 264 0 
gttggcatccactgtgaatgataagctggagctgcaggagtgtctggagcatggcaggat 

MllllillllllMIIMII III II MllillllllMlllli IMIIMIII II MM 

gt tggcat ccactgtgaatgataagc tggagctgcaggagtgt c tggagcatggcaggat 2700 
agccaagttcagcaaagtgaggaccattactacgaggtcaaactccataaagcagggcaa 

IIIIMIIIIIIIIIIIIIIIII III IMIIIIIIIMIIIIIIIIMI llllll INI 

agccaagttcagcaaagtgaggaccattactacgaggtcaaactccataaagcagggcaa 2760 
agaccagcattttcctgtcttcatgaatgagaaagaggacatcttatggtgcactgaaat 

II I II II I II II II I II 1 1 1 1 1 1 1 II II I M I II 1 1 1 II 1 1 1 II I II II II 1 1 1 II 1 1 1 1 

agaccagcattttcctgtcttcatgaatgagaaagaggacatct tatggtgcactgaaat 2 82 0 
ggaaagggtatttggtttcccagtccactatactgacgtctccaacatgagccgcttggc 

II I II I M I II I II 1 1 1 II I II II II II II 1 1 II 1 1 1 1 II II 1 1 1 II II 1 1 1 II I II I II 

ggaaagggtatttggtttcccagtccactatactgacgtctccaacatgagccgcttggc 2 880 
gaggcagagactgctgggccggtcatggagcgtgccagtcatccgccacctcttcgctcc 

MM IMIMIIIMIIIIMIII Ml IIIIIIIMIIIIIIIIIIIIII III lllllll 

gaggcagagactgc tgggccggtcatggagcgtgccagt catccgccacctct tcgc tec 2 94 0 
gctgaaggagtattttgcgtgtgtgtaagggacatgggggcaaactgaggtagcg 

lllllll MIMIMMMMMMMMIMMMMMMIIIIIMI I II II 

gctgaaggagtattttgcgtgtgtgtaagggacatgggggcaaactgaggtagcg 2 995 
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FIGURE 2 

Alignment of predicted amino acids encoded by DNMT3A cDNA in ATCC Deposit 
No. 98809 (top) and predicted amino acids encoded by currently amended SEQ ID 

NO:3 (bottom) 2 



MPAMPSSGPGDTSSSAAEREEDRKDGEEQEEPRGKEERQEPSTTARKVGRPGRKR 55 

MPAMPSSGPGDTSSSAAEREEDRKDGEEQEEPRGKEERQEPSTTARKVGRPGRKR 55 

KHPPVESGDTPKDPAVISKSPSMAQDSGASELLPNGDLEKRSEPQPEERVQLRPC 110 

KHPPVESGDTPKDPAVISKSPSMAQDSGASELLPNGDLEKRSEPQPEEGSPAGGQ 110 

LKPQEQWKMAAAPPRRAEEPLQKRAKNRRRPTSNP*KWRAPGAGCGVA WAGS PAS 165 

KGGAPAEGEGAAETLPEASRAVENGCCTPKEGRGAPAEAGKEQKETNIESMKMEG 165 

VSGPCRGSPSRRGTPTTSASASGTSGWHAGKGRLRRKPRSLQE*MLWKKTRGPGS 22 0 

SRGRLRGGLGWESSLRQRPMPRLTFQAGDPYYISKRKRDEWLARWKREAEKKAKV 22 0 

LRRWRRPALLLCSSPLTPHPPLWLPRLSPWGPMLGTRMPPKQAMTSQSTRTAGAL 275 

GMNAVEENQGPGESQKVEEASPPAVQQPTDPASPTVATTPEPVGSDAGDKNIAAT 275 

ALGSWCGGNCGASPGGQAALCLGG*RAGAEQLKAPAGSCGSETANSQWCVLRS*C 33 0 

KAGDDEPEYEDGRGFGIGELVWGKLRGFSWWPGRI VSWWMTGRSRAAEGTRWVMW 33 0 

R*ARFAVRSTRPRTTSSPCTAKPSTRSCRWPAAARGSCSRCATTAMRVTLPRPWR 3 85 

FGDGKFS WCVEKLMPLS S FCS AFHQATYNKQPMYRKAI YEVLQVAS SRAGKLFP 3 85 

CRTSP*LNGPWGASSLLALRAWSHQKKRRIPTKKCTRTCGWNLRQLPTHHLHQPK 44 0 

VCHDSDESDTAKAVEVQNKPMIEWALGGFQPSGPKGLEPPEEEKNPYKEVYTDMW 440 

SPGRAQRRSPRSRRLLMSAQESGWCTRCGRSAGTLRTSASPVGASMLPWNTPSSL 4 95 

VE PEAAAYAP P P PAKKPRKS TAE KPKVKE 1 1 DERTRERLVYE VRQKCRN I ED I C I 4 95 

EECAKTARTAFWSVRTSTTTTATSPTAP S AVG AVRC S C AETTT AAG AF AWS VWT S 550 

SCGSLNVTLEHPLFVGGMCQNCKNCFLECAYQYDDDGYQSYCTICCGGREVLMCG 550 

WWGRGLPRQPLRKTPGTATCAGTRVPTGCCGGERTGPPGSRCSSLITTTRNLTLQ 605 

NNNCCRCFCVECVDLLVGPGAAQAAIICEDPWNCYMCGHKGTYGLLRRREDWPSRL 605 

RFTHLSQLRRGSPSGCCLSLMESLQGSWC*RTWAFRWTATLPRRCVRTPSRWAWC 660 

QMFFANNHDQEFDPPKVYPPVPAEKRKPIRVLSLFDGIATGLLVLKDLGIQVDRY 660 

GTRGRSCTSGTSAASHRSISRSGAHSIW*LGAVPAMTSPSSTLLARASTRALAGS 715 

IASEVCEDSITVGMVRHQGKIMYVGDVRSVTQKHIQEWGPFDLVIGGSPCNDLSI 715 

SLSSTASCMMRGPRREMIAPSSGSLRMWWPWALVTRGTSRDFSSPTL**LMPKKC 770 

VNPARKGLYEGTGRLFFEF YRLLHDARPKEGDDRPFFWLFENWAMGVSDKRD I S 770 

QLHTGPATSGVTFPV* TGRWHPL *MI SWSCRSVWSMAG* PS SAK* GPLLRGQTP * 825 

RFLESNPVMIDAKEVSAAHRARYFWGNLPGMNRPLASTVNDKLELQECLEHGRIA 825 



* indicates a predicted stop codon, Bolded amino acids are encoded by nucleotides located downstream 
of the deletion. 
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SRAKTSIFLSS*MRKRTSYGALKWKGYLVSQSTILTSPT*AAWRGRDCWAGHGAC 880 

KFSKVRTITTRSNSIKQGKDQHFPVFMNEKEDILWCTEMERVFGFPVHYTDVSNM 880 



QSSATSSLR*RSILRVCKGHGGKLR**AAWRG 

SRLARQRLLGRSWSVPVIRHLFAPLKEYFACV 



912 
912 
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FIGURE 3 

Human DNMT3A from ATCC Deposit No. 98809 with forward reading frames 

(DNMT3A amino acid residues bolded) 

DNA : GCCGCGGCACCAGGGCGCGCAGCCGGGCCGGCCCGACCCCACCGGCCATAC 5 1 



+ 3 
+2 
+ 1 



RGTRARSRAGPTPPAIR 
PRHQGAQPGRPDPTGHT 
AAAPGRAAGPAR PHRPY 



DNA : GGTGGAGCCATCGAAGCCCCCACCCACAGGCTGACAGAGGCACCGTTCACC 102 
+ 3: WSHRSPHPQADRGTVHQ 

+ 2: VEPSKPPPTG*QRHRSP 

+ 1: GGAIEAPTHRLTEAPFT 

DNA : AGAGGGCTCAACACCGGGATCTATGTTTAAGTTTTAACTCTCGCCTCCAAA 153 
+ 3: RAQHRDLCLSFNSRLQR 

+ 2: EGSTPGSMFKF*LSPPK 

+ 1: RGLNTGIYV*VLTLASK 

DNA : GACCACGATAATTCCTTCCCCAAAGCCCAGCAGCCCCCCAGCCCCGCGCAG 2 04 

+3: PR*FLPQSPAAPQPRAA 

+2: TTIIPSPKPSSPPAPRS 

+ 1: DHDNSFPKAQQPPSPAQ 

DNA : CCCCAGCCTGCCTCCCGGCGCCCAGETGCCCGCCATGCCCTCCAGCGGCCC 255 
+ 3: PACLPAPRCPPCPPAAP 

+2: PSLPPGAQMPAMPSSGP 

+ 1: PQPASRRPDARHALQRP 

DNA : CGGGGACACCAGCAGCTCTGCTGCGGAGCGGGAGGAGGACCGAAAGGACGG 3 06 

+ 3: GTPAALLRSGRRTERTE 

+ 2: GDTSSSAAEREEDRKDG 

+ 1: RGHQQLCCGAGGGPKGR 

DNA : AGAGGAG C AGGAGGAGC CGCGTGGCAAGGAGGAGCGC CAAGAGCCCAGCAC 3 57 

+ 3: RSRRSRVARRSAKSPAP 

+2: EEQEEPRGKEERQEPST 

+ 1: RGAGGAAWQGGAPRAQH 

DNA : CACGGCACGGAAGGTGGGGCGGCCTGGGAGGAAGCGCAAGCACCCCCCGGT 4 08 

+3: RHGRWGGLGGSASTPRW 

+2: TARKVGRPGRKRKHPPV 

+ 1: HGTEGGAAWEEAQAPPG 

DNA : GGAAAGCGGTGACACGCCAAAGGACCCTGCGGTGATCTCCAAGTCCCCATC 459 
+ 3: KAVTRQRTLR* SPSPHP 

+ 2: ESGDTPKDPAVISKSPS 

+ 1: GKR*HAKGPCGDLQVPI 

DNA : ' CATGGCCCAGGACTCAGGCGCCTCAGAGCTATTACCCAATGGGGACTTGGA 510 



+ 3 
+ 2 
+ 1 



WPRTQAPQSYYPMGTWR 
MAQDSGASELLPNGDLE 

HGPGLRRLRAITQWGLG 



DNA: GAAGCGGAGTGAGCCCCAGCCAGAGGAG 561 



+ 3 
+ 2 
+ 1 



S G V S P S Q R R 

K R S E P Q P E E DELETION , 

E A E * A P A R G E , 
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DNA : AGGGTGCAGCTGAGACCCTGCCTGAAGC 612 

+ 3: G C S * D P A * S 

+2: R V Q L R P C L K P 

+ 1: G AAETLPEA 

DNA : CTCAAGAGCAGTGGAAAATGGCTGCTGCACCCCCAAGGAGGGCCGAGGAGC 663 
+ 3: LKSSGKWLLHPQGGPRS 
+2: QEQWKMAAAPPRRAEEP 
+ 1: SRAVENGCCTPKEGRGA 

DNA: CCCTGCAGAAGCGGGCAAAGAACAGAAGGAGACCAACATCGAATCCATGAA 714 
+ 3: PCRSGQRTEGDQHRIHE 
+2: LQKRAKNRRRPTSNP * K 

+ 1: PAEAGKEQKETNIESMK 

DNA : AATGGAGGGCTCCCGGGGCCGGCTGCGGGGTGGCTTGGGCTGGGAGTCCAG 765 
+3:,NGGLPGPAAGWLGLGVQ 
+2: WRAPGAGCGVAWAGS PA 

+1: MEGSRGRLRGGLGWESS 

DNA : CCTCCGTCAGCGGCCCATGCCGAGGCTCACCTTCCAGGCGGGGGACCCCTA 816 

+3: PPSAAHAEAHLPGGGPL, 

+2: SVSGPCRGSPSRRGTPT 

+1: LRQRPMPRLTFQAGDPY 

DNA : CTACATCAGCAAGCGCAAGCGGGACGAGTGGCTGGCACGCTGGAAAAGGGA 867 
+3: LHQQAQAGRVAGTLEKG 
+2:' TSASASGTSGWHAGKGR 
+1: YISKRKRDEWLARWKRE 

DNA: GGCTGAGAAGAAAGCCAAGGTCATTGCAGGAATGAATGCTGTGGAAGAAAA 918 
+3: G*EESQGHCRNECCGRK 
+2: LRRKPRSLQE * M L W K K T 

+1: AEKKAKVIAGMNAVEEN 

DNA : CCAGGGGCCCGGGGAGTCTCAGAAGGTGGAGGAGGCCAGCCCTCCTGCTGT 969 

+3: PGARGVSEGGGGQPSCC 

+2: RGPGSLRRWRRPALLLC 

+ 1: QGPGESQKVEEASPPAV 

DNA : GCAGCAGCCCACTGACCCCGCATCCCCCACTGTGGCTACCACGCCTGAGCC 102 0 

+3: A A A H * PRIPHCGYHA*A 
+2: SSPLTPHPPLWLPRLSP 
+1: QQPTDPASPTVATTPEP 

DNA : CGTGGGGTCCGATGCTGGGGACAAGAATGCCACCAAAGCAGGCGATGACGA 1071 
+3: RGVRCWGQECHQSRR*R 
+2: • WGPMLGTRMPPKQAMTS 
+ 1: VGSDAGDKNATKAGDDE 

DNA : GCCAGAGTACGAGGACGGCCGGGGCTTTGGCATTGGGGAGCTGGTGTGGGG 1122 

+3: ARVRGRPGLWHWGAGVG 

+2: QSTRTAGALALGSWCGG 

+ 1: PEYEDGRGFGIGELVWG 

DNA : GAAACTGCGGGGCTTCTCCTGGTGGCC AGGCCGCATTGTGTCTTGGTGGAT 1173 
+3: ETAGLLLVARPHCVLVD 
+2:, NCGASPGGQAALCLGG* 
+1: KLRGFSWWPGRIVSWWM 
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DNA : GACGGGCCGGAGCCGAGCAGCTGAAGGCACCCGCTGGGTCATGTGGTTCGG 1224 
+ 3: DG PE PS S * RHPLGHVVR 
+2:, RAGAEQLKAPAGSCGSE 
+ 1: TGRSRAAEGTRWVMWFG 

DNA : AGACGGCAAATTCTCAGTGGTGTGTGTTGAGAAGCTGATGCCGCTGAGCTC 1275 

+3: RRQILSGVC*EADAAEL 

+2: TANS QWCVLRS * CR * AR 

+ 1: DGKPSVVCVEKLMPLSS 

DNA : GTTTTGCAGTGCGTTCCACCAGGCCACGTACAACAAGCAGCCCATGTACCG 1326 

+3: VLQCVP PGHVQQAAHVP 

+2: FAVRSTRPRTTSSPCTA 

+ 1: FCSAFHQATYNKQPMYR 

DNA : CAAAGCCATCTACGAGGTCCTGCAGGTGGCCAGCAGCCGCGCGGGGAAGCT 13 77 

+3: QSHLRGPAGGQQPRGEA 
+2: KPSTRSCRWPAAARGSC 

+1: KAIYEVLQVASSRAGKL 

DNA : GTTCCCGGTGTGCCACGACAGCGATGAGAGTGACACTGCCAAGGCCGTGGA 1428 

+3: VPGVPRQR*E*HCQGRG 

+2: SRCATTAMRVTLPRPWR 

+1: FPVCHDSDESDTAKAVE 

DNA : GGTGCAGAACAAGCCCATGATTGAATGGGCCCTGGGGGGCTTCCAGCCTTC 1479 
+3: GAEQAHD * MGPGGLPAF 
+2: CRTSP * LNGPWGASSLL 

+1: VQNKPMIEWALGGFQPS 

DNA: TGGCCCTAAGGGCCTGGAGCCACCAGAAGAAGAGAAGAATCCCTACAAAGA 1530 
+ 3: WP*GPGATRRREESLQR 
+2: ALRAWSHQKKRR I PTKK 

+ 1: GPKGLEPPBEEKNPYKE 

DNA : AGTGTACACGGACATGTGGGTGGAACCTGAGGCAGCTGCCTACGCACCACC 1581 

+ 3: SVHGHVGGT*GSCLRTT 

+2: CTRTCGWNLRQLPTHHL 

+ 1: VYTDMWVEPEAAAYAPP 

DNA : TCCACCAGCCAAAAAGCCCCGGAAGAGCACAGCGGAGAAGCCCAAGGTCAA 1632 

+ 3: STSQKAPEEHSGEAQGQ 

+ 2: HQPKSPGRAQRRSPRSR 

+ 1: PPAKKPRKSTAEKPKVK 

DNA : GGAGATTATTGATGAGCGCACAAGAGAGCGGCTGGTGTACGAGGTGCGGCA 1683 

+ 3: GDY* * AHKRAAGVRGAA 

+ 2: RLLMSAQESGWCTRCGR 

+ 1: EI XDERTRERLVYEVRQ 

DNA : GAAGTGCCGGAACATTGAGGACATCTGCATCTCCTGTGGGAGCCTCAATGT 1734 

+ 3: EVPEH*GHLHLLWEPQC 

+2: SAGTLRTSASPVGASML 

+ 1: KCRNIEDICISCGSLNV 

DNA : TACCCTGGAACACCCCCTCTTCGTTGGAGGAATGTGCCAAAACTGCAAGAA 1785 

+ 3: YPGTPPLRWRNVPKLQE 

+2: PWNTPS SLEECAKTART 

+ 1: TLEHPLFVGGMCQNCKN 
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DNA : CTGCTTTCTGGAGTGTGCGTACCAGTACGACGACGACGGCTACCAGTCCTA 183 6 

+ 3: LLSGVCVPVRRRRL PVL 
+2: AFWSVRTSTTTTATSPT 

+ 1: CFLECAYQYDDDGYQSY 

DNA : CTGCACCATCTGCTGTGGGGGCCGTGAGGTGCTCATGTGCGGAAACAACAA 1887 

+ 3: LHHLLWGP*GAHVRKQQ 
+2: APSAVGAVRCSCAETTT 

+ 1: CTICCGGREVL MCGNNN 

DNA : CTGCTGCAGGTGCTTTTGCGTGGAGTGTGTGGACCTCTTGGTGGGGCCGGG 193 8 
+ 3: LLQVLLRGVCGPLGGAG 
+2: AAGAFAWSVWTSWWGRG 
+ 1: CCRCFCVECVDLLVGPG 

DNA : GGCTGCCCAGGCAGCCATTAAGGAAGACCCCTGGAACTGCTACATGTGCGG 198 9 
+ 3: GCPGSH*GRPLELLHVR 
+2: LPRQPLRKTPGTATCAG 
+ 1: AAQAAIKEDPWNCYMCG 

DNA : GCACAAGGGTACCTACGGGCTGCTGCGGCGGCGAGAGGACTGGCCCTCCCG 2 04 0 
+3: AQGYLRAAAAARGLALP 
+2: TRVPTGCCGGERTGPPG 
+ 1: HKGTYGLLRRREDWPSR 

DNA : GCTCCAGATGTTCTTCGCTAATAACCACGACCAGGAATTTGACCCTCCAAA 2 091 

+3: APDVLR* * P R P G I *PSK 

+2: SRCSSLITTTRNLTLQR 

+ 1: LQMFFANNHDQEFDPPK 

DNA : ' GGTTTACCCACCTGTCCCAGCTGAGAAGAGGAAGCCCATCCGGGTGCTGTC 2142 
+3: GLPTCPS * EEEAHPGAV 
+2: FTHLSQLRRGSPSGCCL 
+1: VYPPVPAEKRKPIRVLS 

DNA : TCTCTTTGATGGAATCGCTACAGGGCTCCTGGTGCTGAAGGACTTGGGCAT 2193 

+3: SL *WNRYRAPGAEGLGH 

+2: SLMESLQGSWC*RTWAF 

+1: LFDGIATGLLVLKDLGI 

DNA : * TCAGGTGGACCGCTACATTGCCTCGGAGGTGTGTGAGGACTCCATCACGGT 2244 
+3: SGGPLHCLGGV*GLHHG 
+2: RWTATLPRRCVRTPSRW 
+ 1: QVDRYIASEVCEDS ITV 

DNA : GGGCATGGTGCGGCACCAGGGGAAGATCATGTACGTCGGGGACGTCCGCAG 22 95 

+3: GHGAAPGEDHVRRGRPQ 

+2: AWCGTRGRSCTSGTSAA 

+ 1: GMVRHQGKIMYVGDVRS 

DNA : • CGTCACACAGAAGCATATCCAGGAGTGGGGCCCATTCGATCTGGTGATTGG 234 6 
+3: RHTEAYPGVGPIRSGDW 
+2: SHRSISRSGAHSIW*LG 
+ 1: VTQKHIQEWGPFDLVIG 

DNA : GGGCAGTCCCTGCAATGACCTCTCCATCGTCAACCCTGCTCGCAAGGGCCT 2 3 97 

+3: GQSLQ* PLHRQPCSQGP 
+2: AVPAMTS PSSTLLARAS 

+ 1: GSPCNDL S IVNPARKGL 



- 11 - 



Li et al. 
Appl. No. 09/720,086 



DNA : CTACGAGGGCACTGGCCGGCTCTTCTTTGAGTTCTACCGCCTCCTGCATGA 244 8 

+ 3: LRGHWPALL*VLPPPA* 
+2: TRALAGSSLSSTASCMM 

+ 1: YEGTGRLFFEFYRLLHD 

DNA : TGCGCGGCCCAAGGAGGGAGATGATCGCCCCTTCTTCTGGCTCTTTGAGAA 2499 

+ 3: CAAQGGR*SPLLLAL*E 
+ 2:' RGPRREMIAPSSGSLRM 

+ 1: ARPKEGDDRPFFWLFEN 

DNA : TGTGGTGGCCATGGGCGTTAGTGACAAGAGGGACATCTCGCGATTTCTCGA 2550 

+ 3: CGGHGR* *QEGHLAISR 
+2: WWPWALVTRGTSRDFSS 

+ 1: VVAMGVSDKRDISRFLE 

DNA : GTCCAACCCTGTGATGATTGATGCCAAAGAAGTGTCAGCTGCACACAGGGC 2601 

+ 3: VQPCDD*CQRSVSCTQG 
+ 2: PTL* *LMPKKCQLHTGP 

+ 1: SNPVMIDAKEVSAAHRA 

DNA : CCGCTACTTCTGGGGTAACCTTCCCGGTATGAACAGGCCGTTGGCATCCAC 2 652 

+ 3: PLLLG* PSRYEQAVGIH 
+2: ATSGVTFPV*TGRWHPL 

+ 1: RYFWGNLPGMNRPLAST 

DNA : TGTGAATGATAAGCTGGAGCTGCAGGAGTGTCTGGAGCATGGCAGGATAGC 2703 

+3: C E * *AGAAGVSGAWQDS 
+ 2:- *MISWSCRSVWSMAG*P 

+ 1: VNDKLELQECLEHGRIA 

DNA : CAAGTTCAGCAAAGTGAGGACCATTACTACGAGGTCAAACTCCATAAAGCA 2754 

+ 3: QVQQSEDHYYEVKLHKA 
+2: SSAK*GPLLRGQTP*SR 

+ 1: KFSKVRTITTRSNSIKQ 

DNA : GGGCAAAGACCAGCATTTTCCTGTCTTCATGAATGAGAAAGAGGACATCTT 2 805 

+ 3: GQRPAFSCLHE* ERGHL 
+2:, AKTS I FLSS *MRKRTSY 

+ 1: GKDQHFPVFMNEKEDIL 

DNA : ATGGTGCACTGAAATGGAAAGGGTATTTGGTTTCCCAGTCCACTATACTGA 2 856 

+ 3: MVH*NGKGIWFPSPLY* 
+2: GALKWKGYLVSQSTILT 

+ 1: WCTEMERVFGFPVHYTD 

DNA : CGTCTCCAACATGAGCCGCTTGGCGAGGCAGAGACTGCTGGGCCGGTCATG 2 907 

+ 3: RLQHEPLGEAETAGPVM 
+ 2: SPT*AAWRGRDCWAGHG 

+ 1: VSNMSRLARQRLLGRSW 

DNA : GAGCGTGCCAGTCATCCGCCACCTCTTCGCTCCGCTGAAGGAGTATTTTGC 2 95 8 

+ 3: ERASHPPPLRSAEGVFC 
+ 2: ACQSSATSSLR*RS ILR 

+ 1: SVPVIRHLFAPLKEYFA 

DNA : GTGTGTGTAAGGGACATGGGGGCAAACTGAGGTAGCG 2 995 

+3:VCVRDMGAN*GS 
+2: VCKGHGGKLR* 

+ 1: CV*GTWGQTEVA 
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Estimation of Errors in "Raw" DNA 
Sequences: A Validation Study 

Peter Riehterich 1 
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As DNA sequencing is perforiti^- mdre and more in a mass-production-like manner, efficient quality control 
measure? become increasingly important for process control, but so also does the ability to compare different 
methods and projects. One of the fundamental quality measures in sequencing projects is the position-specific 
error probability at all bases in each individual sequence. Accurate prediction of base-specific error rates from 
-raw" sequence data would allow immediate quality control as well as benchmarking different methods and 
projects while avoiding the Inefficiencies and time delays associated with resequendng and assessments after 
"finishing" a sequence. The program PHRED provides base-specific quality scores that are logarythmially 
related to error probabilities. This study assessed the accuracy of PHRED's error-rate prediction by analyzing 
sequencing projects from six different large-scale sequencing laboratories. All projects used four-color 
fluorescent sequencing/ but the sequencing methods used varied widely between the different projects. The 
results indicate that the error-rate predictions such as those given by PHRED can be highly accurate for a large 
variety of different sequencing methods as well as over a wide range of sequence quality. 



In DNA sequencing, knowledge about the accuracy 
of sequences can be very valuable. For example, dif- 
ferent large-scale sequencing projects may produce 
sequences at similar rates and costs but with signifi- 
cantly different error rates in the final sequence. 
One major determinant in the final error rate is the 
accuracy of the "raw" sequence. Knowledge about 
the frequency and location -of. errors in the raw se- 
quence data can help to direct "polishing" efforts to 
the places where additional effort is needed; it also 
enables the comparison Ixitwccn different sequenc- 
ing projects without requiring that the same region 
be sequenced in each project 

Another area where estimates about sequence 
error rales would be beneficial is -technology ; devel- 
opment. Accurate error estimatcsat each base would 
enable /'quality benchmarking" between different 
methods, thus enabling researchers to choose the 
method that fills their needs for accuracy and 
throughput best. 

Several groups have developed mathematical 
models to predict the error probability at -any given 
position in raw sequences. Lawrence and Solovyev 
used linear discriminant analysis to calculate sepa- 
rate .-probability estimates for Insertions, deletions, 
and mismatches (Lawrence and Solovyev 1994). raw- 
ing and Green (1998) developed the program 
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PHRKD, which calculates a quality score at each 
base. This quality score q is logarithniically linked to 
the error probability ;p: q = -10 x iog lo (p) (for a 
discussion of how quality scores are calculated and 
what the limitations are, see Ewing et al. (1998). 
When used in combination with sequence assembly 
and finishing programs that utilize these error esti- 
mates, reliable error probabilities promise to in- 
crease the accuracy of consensus sequences and to 
reduce the efforts required in the finishing phase of 
sequencing projects (Churchill and Waterman 
1992; Bonfield and Staden 1995). 

To examine the accuracy of probability esti- 
mates made by the program PHRED, we compared 
the actual and predicted error rates for six different 
cosinid- or BAC-sized projects thai were produced 
by six different large-scale sequencing centers in the 
United States! All of these six projects used four- 
color fluorescent sequencing machines; however, 
the DNA preparation methods, sequencing en- 
zymes, fluorescent dyes and chemistries, , and gel 
lengths varied significantly between the six groups. 
Table 1 gives an overview of the sequencing projects 
analyzed. Table 2 lists the different methods used. 

RESULTS 

Error Rate Prediction Accuracy for Six Projects 

A comparison of actual and predicted error rates for 
the six projects in this study Is shown in Table 3. 
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Table 1. Summary of Data Sets 

. Average 
aligned 



Project 


Reads 


Aligned 
bases 


read 
length 


A 


455 


416,214 


915 


6 


1277 


871,230 


682 


C 


1065 


603,655 


567 


D 


834 


414.595 


497 


E . 


1638 


1,149,209 


702 


F 


1885 


907,796 


482 


Total 


7154 


4,362,699 


610 



The results Indicate that PHRED is very successful in 
identifying bases witli low error probabilities. For 
example, the 1.28 million bases with quality scores 
of 4-12 (corresponding to error probabilities be- 
tween 39.8% and 6.3%) contain a total of 187,926 
: errors, in contrast, the 1.44 millf 
ityscb^ 

probabilities between 0.05% and 0.006%) contain 
only 23 7 errors, wb i ch translates into a 790-fold 
lower error rate. The trend toward lower error rates 
caii also be observed for each individual project. In 
most cases, the actual number of errors is close to 
the predicted error rate, it is also apparent that the 
actual error rate is typically lower than the predicted 
error rate. 

Both Lhc -high overall accuracy and the ten- 
dency to slightly overpredict errors arc confirmed 
by statistical analysis, as shown in Table 4. The cor- 
relation between predicted and actual error frequen- 
cies is excellent for all projects {Spearman correla- 
tion coefficient >0.89, P < 0.0001). Averaged over all 
projects, the actual error rate is 84.5% of the pre- 
dicted error rate; the slope of the relation between 
predicted and actual error rates differs slightly be- 
tween projects and ranges from 76.6% to 88.4%. To 
put these differences between projects in relation, it 
is worthwhile remembering that PIIRED quality 
scores cover a wide dynamic range; Hie maximum 
quality score of 51 corresponds to a 50,000-fold 
lower predicted error rate than the minimum qual- 
ity score pf 4, Even the relative difference between 
successive quality (s larger than the relative differ- 
ence in the slopes; for example, a quality score of 10 
corresponds to an error probability of 10%, whereas 
a score of 9 corresponds to an error probability of 
12.6%. 

A different way of looking at the relation be- 
tween the actual and predicted error rates is shown 



in Figure 1. Here> the error rates as a function of the 
position within all reads in each of the projects, av- 
eraged over 50-base windows, is depicted. I'or all six 
projects, the predicted error rales are very close to 
the actual error rates over the entire length of the 
sequences. Each project has a characteristic distribu- 
tion of error rates, which differs from each of the 
other projects. Hie minirnuin error rate differs dra- 
matically between projects. The best projects 
achieve raw error rates of 0. 23%-0.3 6% in the best 
region of the sequence read, typically from base 150 
to 200. The worst project in the data set had an 
-1 0-fold higher error rate of 2.58%. 

Toward the end of sequence reads, the error 
rates increase and start to exceed 10% between bases 
300 arid 700. In projects that used mainly short gels 
(e.g., projects D and V), this increase begins sooner, 
whereas projects that use longer gels show a mark- 
edly longer stretch of low error rates (e.g., projects A 
andB). 

Table 5 summarizes key results for the six 
projects. The first four projects have similar mini- 
mum and average error rates. However, the length 
of the region where the error rate is below 5% differs 
significantly, from 403 to 682 bases. The project 
with the shorter low error rate regions contained 
larger portions of reads generated on short gels, 
whereas projects A and B were run exclusively on 
long gels (ABI373 stretch or ABI377 sequencers). 
Other factors contributing to differences between 
the first four projects were differences in sequencing 
chemistries, production scale, and electrophoresis 
conditions and machines. 

Project E and, in particular, project F, had sig- 
nificantly higher error rates than the first four 
projects. In projects E and F, every sequence gener- 
ated for the project had been included in the data 
set, whereas the other four projects had eliminated 
some "bad'' sequences through manual or auto- 



Table 2. Overview of Sequencing Methods 
Used in the Different Projects 



Template DNA 

Sequencing 

enzymes 
Sequencing 

chemistries 
Sequencing 

machines 
Gel length 



single-sLrunded Ml 3, 

double-stranded plasmids 
Sequenase, Toq, KlenTaqTR, 

AmpliTaq FS 
Dyes primer (two different dyes 

chemistries), dye terminator 
ABI 373, ABI 373 stretch, 

ABI 377 
Only short gels, only long gels, 

mixes of short and long gels 
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Table 3. Comparison of Pred icted and Actual Error Rates for Six Different 
Sequencing Projects 



Projec 


t Qualify score 


4-12 


13-22 


23-32 


33^*2 


43-51 


A 


aligned bases 
expected errors 
■ actual errors 


119,246 
20,256 
. 16,784 


75,293 

2,064 

1,758 


70,391 
1 72 
127 


144,876 

37 

17 


73,234 
1 

1 




aligned bases 
expected errors 
actual errors 


182,034 

29,953 

26,038 


137,940 

3,704 

2,536 


181,998 

410 

287 


399,690 

102 

35 


140,176 
3 

■ -0 v.. 


c 


aligned bases 
expected errors 
actual errors 


139,345 

22,277 

16,670 


131,419 
3,41 1 
1,513 


151,197 

357 

194 


292,070 

74 

26 


68,529 

2 

3 


D 


aligned bases 
expected errors 
actual errors 


103,898 

16,880 

14,495 


68,995 

1,919 

1,924 


68,613 

168 

146 


153,730 

38 

59 


111,752 

3 

2 


E. 


aligned bases 
expected errors 
actual errors 


378,755 

63,947 

55,968 


217,438 

6,336 

6,516 


167,968 

418 

355 


392,717 

95 

67 


144,313 
4 

5 


F 


aligned bases 
expected errors 
actual errors 


359,809 

66,938 

57,971 


136,688 

4,079 

3,856 


98,840 

256 

332 


64,035 

23 

33 


5,130 

0 

1 


All 


aligned bases 
expected errors 
actual errors 


1,283,087 

220,252 

187,926 


767,773 

21,513 

18,103 


739,007 

1,781 

1,441 


1,447,118 
370 

237 


543,134 

13 

12 



matic inspection. After eliminating <10% of the 
worst sequences in project K, the error rates for the 
remaining sequences were comparable to those of 
the first four projects. In contrast, project T showed 
a much more uniform distribution of sequence 
quality. 



The last column in Table 5 shows the average 
number of bases with an estimated error probability 
of at most 0.1%, which is equivalent to a quality 
score of at least 30. The count of such "very high- 
quality" bases is a good indicator of sequence qual- 
ity, both for individual sequences and, when aver- 



Table 4. Summary of Statistical Analysis Results 



Project 


Spearman 
P 


P>1P| 


Slope 


r ratio 




A 


0.9646 


<0.0001 


0.818 


75.1 


<0.0001 


B 


0.9890 


<0.0001 


0.874 


98.2 


<0.0001 


C 


0.9846 


<0.0001 


0.766 


71.6 


<0.0001 


D* 


0.8692 


<0.0001 


0.855 


68.3 


<0.0001 


E 


0.9956 


<0.0001 


0:884 


144.3 


<0.0001 


F 


0.9968 


<0.0001 


0.865 


151.6 


<0.0001 


All 


0.9964 


<0.0001 


0.845 


1 74.5 


<0.0001 



■in project D, the Spearman correlation coefficient p was artificially low as only very few bases (10) bases had 
a 'quality score of 5, and none of these bases contained an actual error (expected: 3.1 6 errors). Exclusion of 
this quality score gave a Spearman correlation coefficient of 0,9786 (/' < 0.0001)- The frequencies in the slope 
calculations were weighed by th'e number of ba*es at any given quality score and, thus, were not sensitive to 
such small sample distortions (see Methods). 
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Figure 1 Actual and predicted error rates in six different sequencing projects. Actual error rates and predicted 
error rates in 50-base windows over Ihe length of the sequence reads, averaged over a IJ reads that could be aligned 
to the consensus sequence by CROSS_MATCH / are shown. The numbers on the x-axis show the first base in a given 
50-base window. 



aged over all sequences in a project, as an indicator 
for the entire project. Compared to the estimated 
error rates, the count of vciy high-quality bases is 
less prone to distortions from a small number of 
low-quality reads, as the data for project i; demon- 
strate. 



Prediction Accuracy for Data Subsets of Different 
Quality 

The quality of sequences within any given project 
can vary substantially/and the use of predicted error 
rates has the potential to be a powerful tool for qual- 



Table 5, Comparison of Key Results for Six Different Sequencing Projects 



Actual minimum Actual average Length of Length of Average bases with 

Project error rate (%) error rate (%) <1% error region <5% error region P(error) <0 1% 



A 
B 
C 
D 
E 
F 



0.36 
0.34 
0.23 
0.39 
0.71 
2.58 



3,6 
2.8 
2.4 
3.1 
4.7 
9:2 



422 
274 
291 
300 
129 
0 



682 
567 
479 
403 
464 
162 



468 
395 
348 
294 
317 
79 
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ity analysis and control in large-scale DN A sequenc- 
ing projects. analyze how accurate; PH RED error 
estimates are for d liferent quality sequences within 
: the same sequencing preset, we subdivided a data 
set into four qua rti 1 e^, based on the number of very 
high-quality bases in each sequence (see Methods). 
The comparison of actual and predicted error rates is 
shown in Figure 2. 

When measured by the error rate in the best 
region df a sequence, tjie data quality in the differ- 
ent quartiles varies > lOG^f olid between the best and 
the worst 25% of the sequences. The best quartile 
showed M);03% error; for > 100 bases/ whereas the 
error rate in the worst quartile always exceeded 5%. 
iriquartiles 2 and 3, the predicted error rates match 
the actual error rates very closely. In the best and 
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Figure 2 Actual and predicted error rates in different quality subsets of project 
B. Sequence reads were sorted by the number ot bases with a predicted error rate 
of at most 0.1% (very high-quality bases), and assigned to quartiles, with quartile 
1 corresponding to the highest numbers. Actual and predicted error rates for all 
sequences in each subset were calculated as in Fig. I. Note that a number of 
sequence reads that had been rejected because of too low quality were added 
back to the data set for illustrative purposes, all of which are in quartile 4. These 
sequences were not included in the data sets used to generate Figs. 1 and 3 and 
Tables 1 and 3. 
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woTst quartiles, PHRED's accuracy was somewhat 
lower from base 100 to 500. Tn the best sequences, 
PHRKD's error estimates were about twofold too 
high; in the worst sequences, the error estimates 
were too low, again by a factor of 2, This underpre- 
diction of errors can be partially explained by the 
fact that PHREI ) gives ambiguous base calls (N's) a 
quality score of 4/ corresponding to an error prob- 
ability of 39,8%; however, N's will always show up 
as an actual error. Even in the worst and best quar- 
tlles, however, the predicted error rate curves aire 
very similar to the actual error rate curves. 

The results shown in Figure 2 also demonstrate 
that the count of very liigh-quality bases, or bases 
with an estimated error probability of at most 0.1 %, 
can be used effectively to characterize the overall 

quality of a sequence read. 
Sorting the sequence reads 
into quartiles based on the 
number of very high-quality 
bases worked well, as shown 
by the >] 00-fold difference in 
the minimum error rate be- 
tween the first and the fourth 
quartile. 

Other methods to charac- 
terize the overall quality of in- 
dividual reads based on 
PHRED quality scores can give 
similar resulLs. For example, 
counting bases above a mini' 
mum quality threshold any- 
where in the range of 2O-40 
gave similar results for most 
data scls (not shown), and 
such counts are used by a 
number of different laborato- 
ries as quality measures. Alter- 
natively, the quality values 
can be converted to error 
probabilities and averaged to 
give the predicted error rate 
for the trace; or summed to 
give the total predicted num- 
ber of errors in a trace. How- 
ever, such averages and totals 
can sometimes give a mislead- 
ing picture, as the following 
example illustrates. Assume 
that two sequence reads have 
very similar quality in the 
alignable part of the read but 
that one of the two sequences 
was run much longer and 
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Figure 3 Actual franheshift and total error rates for projects A and B. To calcu- 
late frarneshifterrorfates, only insertions and deletions were counted. Mismatch 
errors, which account for the vast majority of errors after base 150, were included 
only in the total error count. Note that project B (A, A) has a slightly similar or 
slightly higher tdtai error rate compared to project A (#>0) but only about 
bhe-third as many insertions and deletions up to base 500. For both projects, the 
frameshift error rate in the raw data is <1 in 1000 for >300 bases, and ^1 in 
1 0,000 for >1 00 bases in project B. 



sionally lead to questionable 
conclusions, as the results 
shown in Figure .illustrate. 

Figure 3 shows the total 
actual error rates and the 
frameshift error rates for two 
projects, A and B. The total er- 
ror rates for both projects arc 
simi lar for up to H50 bises; af- 
ter 350 bases, project B has a 
somewhat higher total error 
rate; However, examining the 
frameshift error rate gives rise 
to a different picture: from 
base 1 to 500, project A has 
approximately four times as 
many insertions and dele- 
tions as project B. This differ^ 
ence in frameshift error rates 
can be explained by the se- 
quencing chemistries that 
were used in the two projects. 
Project B, with the lower 
frameshift error rate, used 
only dye terminator chemis- 
try, which is known to elimi- 
nate band spacing artifacts 
from hairpin structures ("corn- 



therefore contains a longer urialignable "tail" of 
■very low-quality bases. When calculating the aver- 
age error rate for these two sequences, the second 
sequence will have a much higher average error and, 
therefore, appear to be of lower quality. In contrast, 
the counts of very high-quality bases for both se- 
quences will be very similar, as the unalignable tails 
contain few, if any, high-quality bases. Therefore, 
counts of bases above a high enough quality thresh- 
old will give a more robust and clearer picture of 
trace qua) 



Frameshift Error Rates for Different Sequencing 
Chemistries 

Depending on how biologists use DNA sequences, 
knowledge about total error rates in raw sequences 
may or iiiay not be sufficient. For example, frame- 
shift errors in coding: sequences will generally lead 
to incorrectly predicted open reading frame, 
whereas mismatch errors will do so only if the mis- 
match introduces a stop codon or a new splice site. 
At the time of this writing, PHRED did not differen- 
tiate between mismatch and frameshift errors, but 
only estimated total error rates. This might op- 



pressions"). Troject A, on the 
other hand, used dye primer chemistry, which is 
more prone Lo insertion and deletion errors from 
mobility artifacts, for most sequencing reactions. 



DISCUSSION 

As large-scale DNA sequencing has become a more 
routine and common process, the traditionaf meth- 
ods for assessing. sequence quality have become un- 
satisfactory. In projects like single-pass cl)NA se- 
quencing, it is not possible to calculate and compare 
error rates after finishing a sequence, as finishing 
never takes place. Even when a comparison between 
raw and finished sequence can be done, the time 
delay between raw data generation and quality as- 
sessment is often large. This delay makes it difficult 
to improve ongoing projects, and it sometimes 
makes it impossible to capture problems early on. 
Mome immediate quality feedback can be reached by 
including known standard sequences for quality 
control. However, this approach can be costly, and 
it fails when error profiles differ between standard 
and unknown sequences. 

In contrasL to these traditional methods to as- 
sess sequence accuracy, direct estimation of error 
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rates in raw sequence data would enable immediate 
quality control and feedback. Accurate, base-by- 
base estimates of error pro babi H ties couJd also in- 
• crease the ulili ty •: of single-pass sequences signifi- 
cantly,^ I loiw effident comparison and optimization 
of jdil^r£ii£ enable the 

development of better software tools for sequence 
assembly and analysis; 

■ The critical question for any error rate predic- 
tion tool is how accurate are the error rate estimates, 
in particular if different sequencing methods and 
chemistries are used? The results presented herein 
provide an answer to this question for the program 
PJIRRT5; as well as clues where further development 
would be usehjl. As shown in Tables 3 and 4 and in 
Figure 1, the ag^^mcnt between predicted and ac- 
: tual error rates was very good in each of the six 
different projects analyzed. The Observed high level 
of prediction accuracy in all of these projects is al- 
most astonishing if one takes into account that ac- 
tual errors are binary (a base is cither correct or 
wrong); whereas predicted error rates are probabili- 
ties on a scale from Q.O to 1.0. The observed ten- 
dency to pverpreriict error rates can be at least par- 
tially explained 

: that ^ param- 
eters for quality scores (lowing and Green 1998), For 
most practical applications, such a somewhat con- 
servative estimation of quality scores is tolerable or 
even desirable. Overall, the results clearly show that 
error probabilities given by PHRED accurately de- 
scribe raw sequence data quality. 

in judging the usefulness of predicted error 
probabilitiev it Is important to know how differ- 
ences in sequencing methods will influence the pre- 
diction accuracy. For example; the larger variation 
in peak heights tends to be larger in dye terminator 
sequencing than in dye primer sequencing, and dif- 
ferent sequencing enTymes are known to produce 
different specific height variation patterns. Any es- 
timation of error probabilities that takes the pecu- 
liarities of a specific sequericinjg chemistry into ac- 
count would therefore be expected to be less accu- 
rate for different chemistries. 

The projects included in this sludy were specifi- 
cally chosen to provide an initial answer to the 
question of how generally useful PHRKD quality 
scores are. These projects represent the vast majority 
of different multicolor fluorescent sequencing 
methods used in the last 3 years: differen t template 
DNAs and DNA preparation methods, different en- 
zymes, gel lengths, run conditions, and different 
fluorescent dyes. The data also include a consider- 
able spread in data quality, both between projects 



arid within individual projects; None of the projects 
analyzed here were included in PHRl'lVs training 
set, and just one of the six laboratories that contrib- 
uted data to this study also contributed data to the 
training data sets. One of the projects in this study 
consisted entirely of dye terminator sequences, 
which presented only a small fraction of the se- 
■ quences in the test data set. Another project exclu- 
sively used a set of fluorescent dyes different from 
those used in the training sets. Each project differed 
from the other projects in this study in at least one, 
and typically many, experimental aspects like tem- 
plate preparation, sequencing enzymes, gel run con- 
ditions, and so forth. Despite these differences, the 
accuracy of error rate predictions was very similar 
for all projects. 

Our results justify some optimism about the ac- 
curacy of PHRED quality scores for minor changes 
in sequencing technology, for example, sequences 
generated by new enzymes and fluorescent dyes. 
Initial studies showed that PHRED quality scores 
were also accurate for sequences produced by mul- 
tiplex; sequencing with radioactive detection (P. 
Richterich, unpuhL). However, we also observed 
two effects thai Can invalidate PHRED quality scores 
during these studies. First, sequences generated by 
chemical sequencing gave too low quality scores at 
mixed (A + G) reactions. Because secondary peak 
height is one of the parameters used in the error rate 
predictions, this is not surprising. Another potential 
source of error is high-frequency noise in the trace 
data. With such data, PHRED occasionally underes- 
timated the hand spacing by a factor of 2 or more, 
which resulted in incorrect base calls and quality 
scores. By applying simple smoothing algorithms to 
data with high-frequency noise, these problems 
could typically be resolved. Similar steps may be 
necessary to obtain accurate PIJRED quality scores 
on data thai have been generated by different se- 
quencing instruments or preprocessed by different 
software. 

Accurate quality scores can have a major impact 
on how sequences are used downstream from the 
sequence production process. In traditional se- 
quencing projects where the goat is complete cov- 
erage at a final error rate below (e.g.) I in 10,000, the 
accuracy goals can be reached with single sequence 
reads as long as the quality scores are at least 40 
(however, other poteritialproblems like clone insta- 
bility may make higher coverage advisable). Inter- 
esting questions arise as to how individual read 
quality contributes to project quality, or the error 
rate of the "final" sequence. Under the assumption 
that errors between different sequence reads arc 
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completely independent- one could argue that two 
reads with a quality score of 20 (error probabil ity of 
1 in 1 00) are j ust as y a h i able a s one sequence wi th a 
quality score of; 40 (error probability of 1 in 10,000) : 
However/ although a single sequence stretch with 
quality levels above 40 would give a final sequence 
with an error rate of <1 in 10,000 r asscmbliii^ aeon- 
sensus from Iwo sequence quality scores of 20 
( 1 % error rate) could lead to one of two results: If 
the errors were completely random, the consensus 
.sequence wb^ld'be ambiguous at 2% of all loca- 
tions; if the errors were completely localized, for ex- 
ample, because of reproducible compressions, the 
consensus sequence would have one "hidden" error 
every 100 bases. Typically, consensus sequences de- 
rived from low-quality sequences will have both 
kinds of problematic regions. Increased coverage 
can rapidly eliminate the random errors; however, 
increased coverage does not resolve errors from sys- 
tematic sources v Maiiual examination of such prob- 
lem areas is generally required; such "con tig edit- 
ing/ 7 however, tends to be time consuming, re- 
quires highly trained personnel, is an obstacle 
toward complete automation of DNA sequencing;, 
and sometimes fails to eliminate all errors. This 
leads to the somewhat counterintuitive conclusion 
that the practical value of increasing sequence qual- 
ity can be even higher than indicated by the quality 
scores: One sequence of average quality above 40 
can be "worth" more than two sequences of average 
quality 20. 

Another application of DNA sequencing where 
high quality can be of d isproportionately high value 
is the search for mutations in genomic DNA, In low 
quality sequences, secondary peaks and low resolu- 
tion often complicate the identification of hetero- 
zygous mutations. In regions of higher sequence 
quality, such secondary peaks are smaller or absent 
and peaks are better resolved, therefore, both false- 
positive and falsernegative errors can be signifi- 
cantly reduced in high-quality regions. Tools like 
PHRED, which can accurately measure sequence 
quality from trace data, can be of twofold value for 
mutation detection. First, base-specific quality 
scores can allow optimization pf sequencing meth- 
ods and strategies for mutation detection. Second, 
the quality scores can be used to evaluate the use- 
fulness of individual sequence reads for mutation 
detection (e.g., by discarding reads below minimum 
thresholds), and they can guide software that auto- 
matically detects mutations. 

Tine ability (o predict error rates in a highly ac- 
curate fashion is likely to have a major impact in 
applications like those described above. PHRED is 



the first widely used program that accurately prc^ 
diets base-specific error probabilities. However, the 
algorithm fqr determining quality values has been 
described (Ewiiig arid Green 1 998), and it should be 
straightforward to implement similar quality Values 
in other base-calling programs. Furthermore, an ex- 
tension of the approach developed by Kwing and 
Green should be possible. For example, differentia- 
tion between mismatdi arid frameshift errors would 
en able better comparisons of seq u en cing met! l pds 
with similar total error rates but d if fererit tranieshift 
error rates/ Several groups have described efforts to 
calculate separate probabilities (or ''confidence as- 
sessments") for mismatch errors and frameshift er- 
rors (Lawrence and Solovyev 1994; Bcrnd 1996). 
Their results demonstrated that different ap- 
proaches to error type characterization are feasible 
and promising. Implementation of such error type 
predictions in other programs similar to the way 
PHRED uses quality scores would enable better 
method assessments, benchmarking, and product ion 
quality control, and could have a significant impact 
on downstream uses of DNA sequence information. 

METHODS 
Data Sets 

{•or one project, sequence raw data in the form of 
ABI trace files were downloaded from a public FTP 
site. Sequence data for the five other projects were 
kindly provided by five different large-scale se- 
quencing groups. Table 1 gives a .summary of the six 
projects, and Table 2 gives an overview of the dif- 
ferent sequencing methods used in the projects. The 
projects differed in the amount of prescreeriing of 
data that had been done, reflecting different ap- 
proaches to quality control in different laboratories. 
In two projects (B and C), different software pro- 
grams had been used to identify and eliminate low- 
quality sequences. One pro ject (F) included all data 
files generated, whereas the other three projects had 
excluded "failed lanes." 

Comparison of Acciral and Predicted Error Rates 

The sequences for all traces in each project were 
recalled using the program PHRED (v. 96102H). 
Next, sequences in each project were assembled 
with P1IRAP (P. Green, unpubl.). Slightly different 
methods were chosen for the statistical and graphi- 
cal evaluation of the error rate prediction accuracy, 
in the statistical evaluation, only the longest contig 
produced by PHRAP was considered. The tables of 
aligned bases and observed discrepancy counts for 
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each quality score were taken from the PHRAP out- 
put and aTial^ expected number 
of discrepancies (£) at each- q uality score (i/) was cal- 
culated by multiplying the mm aligned bases 
(N) with the error probability corrcspcMidiiiij to the 
qU^ity ; score: E^NIO^ lM . Tile Spearman ranking 
coefficients were calculated 
pected and observed error frequencies. To obtain 
the qu an tf ta ti ve re I a ti on between the expected and 
observed error rales over the entire range, a Icast- 
sqiiares fit between the observed and expected rates 
was performed, with the intercept set to zero and 
the number of aligned bases at each quality score 
used as weights. v. 

For a graphical comparison of estimated and ac- 
tual error rates in 5()-bp windows, the following 
steps were taken. For two of the projects, the con- 
sensus sequence was retrieved from public data- 
bases, lor the four other projects, the DNA sequence 
and quality information were used by the program 
PHRAP to assemble consensus sequences for each of 
the projects. The individual reads were aligned to 
the consensus sequences of the longest contig, us- 
ing the program CROSS_MATGH (P. Green, un- 
publ.), after removing single-coverage regions from 
the ends of the consensus sequence; ACROSS- 
_MATCH uses an implemenialion of the Smith 
Waterman algorithm to generate alignments that 
typically do not include the ends of sequences, 
where disagreements are commonly due to vector 
sequence or low quality sequence. 

The quality files generated; by PHRED and the 
alignment summaries generated by CROSS- 
_MATCH were then analyzed as follows. First, the 
region of each query sequence that had been aligned 
l)y CROSS_MATCH was dclcrniincd. Next, the actual 
and predicted error rates for the entire aligned part of 
each individual sequence was calculated* In addi- 
tion, the average actual and predicted error rates for 
all alignable sequences together were calculated for 
windows of 50 bases in length. To calculate the pre- 
dicted error rate, the quality scores q determined by 
PHRED at each base were converted to error prob- 
abilities as described above (Evying and Green 1 99S). 

Subdividing Data Into Subsets Based on Data Quality 

To examine Lhc accuracy of PHRED quality scores 
for data subsets of different quality within a project; 
the following approach was taken. For all sequence 
reads in project B, the number of bases with a qual- 
ity score of at least 30 In each sequence was deter- 
mined (bases with quality scores of at least 30 were 
called very high-quality bases, or VHQ bases). Se- 



quences were' sorted in descending order based on 
the number of very hfgh-quaHty bases, and divided 
into four quarliles. Accordingly, quartile 1 con- 
tained 25% of sequences with the highest number 
of very higli-qualily bases, and quartile 4 contained 
the "worst" sequences. To illustrate the prediction 
accuracy in data with relatively liigli error rates, se- 
quences from project B that had be^n ''discarded^ 
becaiLse they had hot met the minimum quality cri- 
teria were added back to the data set. The sequences 
in each quartile were compared to the consensus se- 
quences that had been generated using the entire data 
set, as described above for the graphical comparison. 

Determining Actual Frameshift Error Rates 

The calculation of actual frameshift error rates in 
the raw sequence data was performed using CROSS 
__MATCH, similar to the procedure described above 
for total error rates, except that only insertion and 
deletion errors were counted. Because PHRED docs 
not give sepiarate frameshift error estimates, a com- 
parison of predicted and actual frariiesliift errors is 
not possible. 
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